The Split-\(\widehat{R}\) statistic and the effective sample size (abbreviated as ESS or \(S_{\rm eff}\); previously called \(N_{\rm eff}\) or \(n_{\rm eff}\)) are routinely used to monitor the convergence of iterative simulations, which are omnipresent in Bayesian statistics in the form of Markov-Chain Monte-Carlo samples. The original \(\widehat{R}\) statistic (Gelman and Rubin, 1992; Brooks and Gelman, 1998) and split-\(\widehat{R}\) (Gelman et al., 2013) are both based on the ratio of between and within-chain marginal variances of the simulations, while the latter is computed from split chains (hence the name).
\(\widehat{R}\), split-\(\widehat{R}\), and \(S_{\rm eff}\) are well defined only if the marginal distributions have finite mean and variance. Even if that’s the case, their estimates are less stable for distributions with long tails. To alleviate these problems, we define split-\(\widehat{R}\) and \(S_{\rm eff}\) using rank normalized values, empirical cumulative density functions, and small posterior intervals.
The code for the new split-\(\widehat{R}\) and \(S_{\rm eff}\) versions and for the corresponding diagnostic plots can be found in monitornew.R and monitorplot.R, respectively.
In this section, we will review the split-\(\widehat{R}\) and effective sample size estimates as implemented in Stan 2.18 (Stan Development Team, 2018e). These implementations represent the current de facto standard of convergence diagnostics for iterative simulations.
Below, we present the computation of Split-\(\widehat{R}\) following Gelman et al. (2013), but using the notation style of Stan Development Team (2018a). Our general recommendation is to always run several chains. \(N\) is the number of draws per chain, \(M\) is the number of chains, and \(S=MN\) is the total number of draws from all chains. For each scalar summary of interest \(\theta\), we compute \(B\) and \(W\), the between- and within-chain variances:
\[ B = \frac{N}{M-1}\sum_{m=1}^{M}(\overline{\theta}^{(.m)}-\overline{\theta}^{(..)})^2, \;\mbox{ where }\;\;\overline{\theta}^{(.m)}=\frac{1}{N}\sum_{n=1}^N \theta^{(nm)},\;\; \;\;\overline{\theta}^{(..)} = \frac{1}{M}\sum_{m=1}^M\overline{\theta}^{(.m)} \\ W = \frac{1}{M}\sum_{m=1}^{M}s_j^2, \;\mbox{ where }\;\; s_m^2=\frac{1}{N-1} \sum_{n=1}^N (\theta^{(nm)}-\overline{\theta}^{(.m)})^2. \]
The between-chain variance, \(B\), also contains the factor \(N\) because it is based on the variance of the within-chain means, \(\overline{\theta}^{(.m)}\), each of which is an average of \(N\) values \(\theta^{(nm)}\).
We can estimate \(\mbox{var}(\theta \mid y)\), the marginal posterior variance of the estimand, by a weighted average of \(W\) and \(B\), namely \[ \widehat{\mbox{var}}^+(\theta \mid y)=\frac{N-1}{N}W + \frac{1}{N}B. \] This quantity overestimates the marginal posterior variance assuming the starting distribution of the simulations is appropriately overdispersed compared to the target distribution, but is unbiased under stationarity (that is, if the starting distribution equals the target distribution), or in the limit \(N\rightarrow\infty\). To have an overdispersed starting distribution, independent Markov chains should be initialized with diffuse starting values for the parameters.
Meanwhile, for any finite \(N\), the within-chain variance \(W\) should underestimate \(\mbox{var}(\theta \mid y)\) because the individual chains haven’t had the time to explore all of the target distribution and, as a result, will have less variability. In the limit as \(N\rightarrow\infty\), the expectation of \(W\) also approaches \(\mbox{var}(\theta \mid y)\).
We monitor convergence of the iterative simulations to the target distribution by estimating the factor by which the scale of the current distribution for \(\theta\) might be reduced if the simulations were continued in the limit \(N\rightarrow\infty\). This potential scale reduction is estimated as \[ \widehat{R}= \sqrt{\frac{\widehat{\mbox{var}}^+(\theta \mid y)}{W}}, \] which declines to 1 as \(N\rightarrow\infty\). We call this split-\(\widehat{R}\) because we are applying it to chains that have been split in half so that \(M\) is twice the number of actual chains. Without splitting, \(\widehat{R}\) would get fooled by non-stationary chains (see Appendix D).
We note that split-\(\widehat{R}\) is also well defined for sequences that are not Markov-chains. However, for simplicity, we always refer to ‘chains’ instead of more generally to ‘sequences’ as the former is our primary use case for \(\widehat{R}\)-like measures.
If the \(N\) simulation draws within each chain were truly independent, the between-chain variance \(B\) would be an unbiased estimate of the posterior variance, \(\mbox{var}(\theta \mid y)\), and we would have a total of \(S = MN\) independent simulations from the \(M\) chains. In general, however, the simulations of \(\theta\) within each chain will be autocorrelated, and thus \(B\) will be larger than \(\mbox{var}(\theta \mid y)\), in expectation.
A nice introductory reference for analyzing MCMC results in general and effective sample size in particular is provided by Geyer (2011, see also 1992). The particular calculations used by Stan (Stan Development Team, 2018e) follow those for split-\(\widehat{R}\), which involve both between-chain (mean) and within-chain calculations (autocorrelation). They were introduced in the Stan manual (Stan Development Team, 2018d) and explained in more detail in Gelman et al. (2013).
One way to define effective sample size for correlated simulation draws is to consider the statistical efficiency of the average of the simulations \(\bar{\theta}^{(..)}\) as an estimate of the posterior mean \(\mbox{E}(\theta \mid y)\). This generalizes also to posterior expectations of functionals of parameters \(\mbox{E}(g(\theta) \mid y)\) and we return later to how to estimate the effective sample size of quantiles which cannot be presented as expectations. For simplification, in this section we consider the effective sample size for the posterior mean.
The effective sample size of a chain is defined in terms of the autocorrelations within the chain at different lags. The autocorrelation \(\rho_t\) at lag \(t \geq 0\) for a chain with joint probability function \(p(\theta)\) with mean \(\mu\) and variance \(\sigma^2\) is defined to be \[ \rho_t = \frac{1}{\sigma^2} \, \int_{\Theta} (\theta^{(n)} - \mu) (\theta^{(n+t)} - \mu) \, p(\theta) \, d\theta. \] This is just the correlation between the two chains offset by \(t\) positions. Because we know \(\theta^{(n)}\) and \(\theta^{(n+t)}\) have the same marginal distribution in an MCMC setting, multiplying the two difference terms and reducing yields \[ \rho_t = \frac{1}{\sigma^2} \, \int_{\Theta} \theta^{(n)} \, \theta^{(n+t)} \, p(\theta) \, d\theta. \]
The effective sample size of one chain generated by a process with autocorrelations \(\rho_t\) is defined by \[ N_{\rm eff} \ = \ \frac{N}{\sum_{t = -\infty}^{\infty} \rho_t} \ = \ \frac{N}{1 + 2 \sum_{t = 1}^{\infty} \rho_t}. \]
Effective sample size \(N_{\rm eff}\) can be larger than \(N\) in case of antithetic Markov chains, which have negative autocorrelations on odd lags. The Dynamic Hamiltonian Monte-Carlo algorithms used in Stan (Hoffman and Gelman, 2014; Betancourt, 2017) can produce \(N_{\rm eff}>N\) for parameters with a close to Gaussian posterior (in the unconstrained space) and low dependency on other parameters.
In practice, the probability function in question cannot be tractably integrated and thus neither autocorrelation nor the effective sample size can be calculated. Instead, these quantities must be estimated from the samples themselves. The rest of this section describes an autocorrelation and split-\(\widehat{R}\) based effective sample size estimator, based on multiple split chains. For simplicity, each chain will be assumed to be of the same length \(N\).
Stan carries out the autocorrelation computations for all lags simultaneously using Eigen’s fast Fourier transform (FFT) package with appropriate padding; see Geyer (2011) for more details on using FFT for autocorrelation calculations. The autocorrelation estimates \(\hat{\rho}_{t,m}\) at lag \(t\) from multiple chains \(m \in (1,\ldots,M)\) are combined with the within-chain variance estimate \(W\) and the multi-chain variance estimate \(\widehat{\mbox{var}}^{+}\) introduced in the previous section to compute the combined autocorrelation at lag \(t\) as \[ \hat{\rho}_t = 1 - \frac{\displaystyle W - \textstyle \frac{1}{M}\sum_{m=1}^M \hat{\rho}_{t,j}}{\widehat{\mbox{var}}^{+}}. \label{rhohat} \] If the chains have not converged, the variance estimator \(\widehat{\mbox{var}}^{+}\) will overestimate the true marginal variance which leads to an overestimation of the autocorrelation and an underestimation of the effective sample size.
Because of noise in the correlation estimates \(\hat{\rho}_t\) increases as \(t\) increases, typically the truncated sum of \(\hat{\rho}_t\) is used. Negative autocorrelations can happen only on odd lags and by summing over pairs starting from lag \(t=0\), the paired autocorrelation is guaranteed to be positive, monotone and convex modulo estimator noise (Geyer, 1992, 2011). Stan 2.18 uses Geyer’s initial monotone sequence criterion. The effective sample size of combined chains is defined as \[ S_{\rm eff} = \frac{N \, M}{\hat{\tau}}, \] where \[ \hat{\tau} = 1 + 2 \sum_{t=1}^{2k+1} \hat{\rho}_t = -1 + 2 \sum_{t'=0}^{k} \hat{P}_{t'}, \] and \(\hat{P}_{t'}=\hat{\rho}_{2t'}+\hat{\rho}_{2t'+1}\). The initial positive sequence estimator is obtained by choosing the largest \(k\) such that \(\hat{P}_{t'}>0\) for all \(t' = 1,\ldots,k\). The initial monotone sequence estimator is obtained by further reducing \(\hat{P}_{t'}\) to the minimum of the preceding values so that the estimated sequence becomes monotone.
The effective sample size \(S_{\rm eff}\) described here is different from similar formulas in the literature in that we use multiple chains and between-chain variance in the computation, which typically gives us more conservative claims (lower values of \(S_{\rm eff}\)) compared to single chain estimates, especially when mixing of the chains is poor. If the chains are not mixing at all (e.g., the posterior is multimodal and the chains are stuck in different modes), then our \(S_{\rm eff}\) is close to the number of chains.
Before version 2.18, Stan used a slightly incorrect initial sequence which implied that \(S_{\rm eff}\) was capped at \(S\) and thus the effective sample size was underestimated for some models. As antithetic behavior (i.e., \(S_{\rm eff} > S\)) is not that common or the effect is small, and capping at \(S\) can be considered to be pessimistic, this had negligible effect on any inference. However, it may have led to an underestimation of Stan’s efficiency when compared to other packages performing MCMC sampling.
As split-\(\widehat{R}\), and \(S_{\rm eff}\) are well defined only if the marginal posteriors have finite mean and variance, we next introduce split-\(\widehat{R}\) and \(S_{\rm eff}\) using rank normalized values, empirical cumulative density functions, and small posterior intervals which are well defined for all distributions and more robust for long tailed distributions.
Rank normalized split-\(\widehat{R}\) is computed using the equations in Section Split-\(\widehat{R}\) by replacing the original parameter values \(\theta^{(nm)}\) with their corresponding rank normalized values \(z^{(nm)}\).
Rank normalization:
Appendix B illustrates the rank normalization of multiple chains.
For continuous variables and \(S \rightarrow \infty\), the rank normalized values are normally distributed and rank normalization performs non-parametric transformation to normal distribution. Using normalized ranks instead of ranks directly, has the benefit that the behavior of \(\widehat{R}\) and \(S_{\rm eff}\) do not change from the ones presented in Section Split-\(\widehat{R}\) for normally distributed \(\theta\).
Both original and rank-normalized split-\(\widehat{R}\) can be fooled if chains have different scales but the same location (see Appendix D). To alleviate this problem, we propose to compute rank normalized folded-split-\(\widehat{R}\) using folded split chains by rank normalizing absolute deviations from median \[ {\rm abs}(\theta^{(nm)}-{\rm median}(\theta)). \]
To obtain a single conservative \(\widehat{R}\) estimate, we propose to report the maximum of rank normalized split-\(\widehat{R}\) and rank normalized folded-split-\(\widehat{R}\) for each parameter.
In addition to using rank-normalized values for convergence diagnostics via \(\widehat{R}\), we can also compute the corresponding effective sample size \(S_{\rm eff}\), using equations in Section Effective sample size by replacing parameter values \(\theta^{(nm)}\) with rank normalized values \(z^{(nm)}\). This estimate will be well defined even if the original distribution does not have finite mean and variance. It is not directly applicable to estimate the Monte Carlo error of the mean of the original values, but it will provide a bijective transformation-invariant estimate of the mixing efficiency of chains. For parameters with a close to normal distribution, the difference to using the original values is small. However, for parameters with a distribution far from normal, rank normalization can be seen as a near optimal non-parametric transformation.
The effective sample size using rank normalized values is a useful measure for efficiency of estimating the bulk (mean and quantiles near the median) of the distribution, and as shorthand term we use term bulk effective sample size (bulk-ESS). Bulk-ESS is also useful for diagnosing problems due to trends or different means of the chains (see Appendix D).
The bulk- and tail-ESS introduced above are useful as overall efficiency measures. Next, we introduce effective sample size estimates for the cumulative distribution function (CDF), and later we use that to introduce efficiency diagnostics for quantiles and local small probability intervals.
Quantiles and posterior intervals derived on their basis are often reported quantities which are easy to estimate from posterior draws. Estimating the efficiency of such quantile estimates thus has a high practical relevance in particular as we observe the efficiency for tail quantiles to often be lower than for the mean or median. The \(\alpha\)-quantile is defined as the parameter value \(\theta_\alpha\) for which \(p(\theta \leq \theta_\alpha) = \alpha\). An estimate \(\hat{\theta}_\alpha\) of \(\theta_\alpha\) can thus be obtained by finding the \(\alpha\)-quantile of the empirical CDF (ECDF) of the posterior draws \(\theta^{(s)}\). However, quantiles cannot be written as an expectation, and thus the above equations for \(\widehat{R}\) and \(S_{\rm eff}\) are not directly applicable. Thus, we first focus on the efficiency estimate for the cumulative probability \(p(\theta \leq \theta_\alpha)\) for different values of \(\theta_\alpha\).
For any \(\theta_\alpha\), the ECDF gives an estimate of the cumulative probability \[ p(\theta \leq \theta_\alpha) \approx \bar{I}_\alpha = \frac{1}{S}\sum_{s=1}^S I(\theta^{(s)} \leq\theta_\alpha), \] where \(I()\) is the indicator function. The indicator function transforms simulation draws to 0’s and 1’s, and thus the subsequent computations are bijectively invariant. Efficiency estimates of the ECDF at any \(\theta_\alpha\) can now be obtained by applying rank-normalizing and subsequent computations directly on the indicator function’s results. See Appendix C for an illustration of variance of ECDF.
Assuming that we know the CDF to be a certain continuous function \(F\) which is smooth near an \(\alpha\)-quantile of interest, we could use the delta method to compute a variance estimate for \(F^{-1}(\bar{I}_\alpha)\). Although we don’t usually know \(F\), the delta method approach reveals that the variance of \(\bar{I}_\alpha\) for some \(\theta_\alpha\) is scaled by the (usually unknown) density \(f(\theta_\alpha)\), but the efficiency depends only on the efficiency of \(\bar{I}_\alpha\). Thus, we can use the effective sample size for the ECDF (we computed using the indicator function \(I(\theta^{(s)} \leq \theta_\alpha)\)) also for the corresponding quantile estimates.
In order to summarize the efficiency in the distributions’ tails, we propose to compute the minimum of the effectve sample sizes of the 5% and 95% quantiles. As a shorthand, we use the term tail effective sample size (tail-ESS). Tail-ESS is also useful for diagnosing problems due to different scales of the chains (see Appendix D).
Since the marginal posterior distributions might not have finite mean and variance, by default RStan (Stan Development Team, 2018c) and RStanARM (Stan Development Team, 2018b) report median and median absolute deviation (MAD) instead of mean and standard error (SE). Median and MAD are well defined even when the marginal distribution does not have finite mean and variance. Since the median is just 50%-quantile, we can get efficiency estimate for it as for any other quantile.
We can also compute the efficiency estimate for the median absolute deviation (MAD), by computing the efficiency estimate for the median of absolute deviations from the median of all draws. The absolute deviations from the median of all draws are same as previously defined for folded samples \[ {\rm abs}(\theta^{(nm)}-{\rm median}(\theta)). \] We see that the efficiency estimate for MAD is obtained by using the same approach as for the median (and other quantiles) but with the folded values also used in rank-normalized-folded-split-\(S_{\rm eff}\).
Previously, Stan has reported Monte Carlo standard error estimates for the mean of a quantity. This is valid only if the corresponding marginal distribution has finite mean and variance; and even if valid, it may be easier and more robust to focus on the median and other quantiles, instead.
Median, MAD and quantiles are well defined even when the distribution does not have finite mean and variance, and they are asymptotically normal for continuous distributions which are non-zero in the relevant neighborhood. As the delta method for computing the variance would require explicit knowledge of the normalized posterior density, we propose the following alternative approach:
We can get more local efficiency estimates by considering small probability intervals. We propose to compute the efficiency estimates for \[ \bar{I}_{\alpha,\delta} = p(\hat{Q}_\alpha < \theta \leq \hat{Q}_{\alpha+\delta}), \] where \(\hat{Q}_\alpha\) is an empirical \(\alpha\)-quantile, \(\delta=1/k\) is the length of the interval with some positive integer \(k\), and \(\alpha \in (0,\delta,\ldots,1-\delta)\) changes in steps of \(\delta\). Each interval has \(S/k\) draws, and the efficiency measures autocorrelation of an indicator function which is \(1\) when the values are inside the specific interval and \(0\) otherwise. This gives us a local efficiency measure which does not depend on the shape of the distribution.
In addition to using rank-normalized values to compute split-\(\widehat{R}\), we propose to use rank plots for each chain instead of trace plots. Rank plots are nothing else than histograms of the ranked posterior samples (ranked over all chains) plotted separately for each chain. If rank plots of all chains look similar, this indicates good mixing of the chains. As compared to trace plots, rank plots don’t tend to squeeze to a mess in case of long chains.
The proposal is to switch in Stan:
Justifications for the changes are:
In summary outputs, we propose to use Rhat to denote also the new version. As \(n\) and \(N\) often refer to the number of observations, we propose to use acronym ESS for the effective sample size.
We propose to add to the bayesplot package:
Based on the experiments presented in Appendices D-F, more strict convergence diagnostics and effective sample size warning limits could be used. We propose the following warning thresholds although additional experiments would be useful:
In case of running 4 chains, an effective sample size of 400 corresponds to having an effective sample size of 50 for each 8 split chains, which we consider to be minimum for reliable mean, variance and autocorrelation estimates needed for the convergence diagnostic. We recommend running at least 4 chains to get reliable between chain variances for the convergence diagnostics.
Plots shown in the upcoming sections have dashed lines based on these thresholds (in continuous plots, a dashed line at 1.005 is plotted instead of 1.01, as values larger than that are usually rounded in our summaries to 1.01).
In this section, we will go through some examples to demonstrate the usefulness of our proposed methods as well as the associated workflow in determining convergence. Appendices D-G contain more detailed analysis of different algorithm variants and further examples.
First, we load all the necessary R packages and additional functions.
library(tidyverse)
library(gridExtra)
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans"))
library(rjags)
library(abind)
source('monitornew.R')
source('monitorplot.R')
The classic split-\(\widehat{R}\) is based on calculating within and between chain variances. If the marginal distribution of a chain is such that the variance is not defined (i.e., infinite), the classic split-\(\widehat{R}\) is not well justified. In this section, we will use the Cauchy distribution as an example of such a distribution. Appendix E contains more detailed analysis of different algorithm variants and further Cauchy examples.
The following Cauchy models are from Michael Betancourt’s case study Fitting The Cauchy Distribution
The nominal Cauchy model with direct parameterization is as follows:
writeLines(readLines("cauchy_nom.stan"))
parameters {
vector[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the nominal model:
fit_nom <- stan(file = 'cauchy_nom.stan', seed = 7878, refresh = 0)
Warning: There were 1233 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems
Treedepth exceedence and Bayesian fraction of missing information are dynamic HMC specific diagnostics (Betancourt, 2017). We get warnings about a very large number of transitions after warmup that exceeded the maximum treedepth, which is likely due to very long tails of the Cauchy distribution. All chains have low estimated Bayesian fraction of missing information also indicating slow mixing.
mon <- monitor(fit_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
x[1] -5.738 -0.015 6.313 2.505 36.377 1.028 1181 393
x[2] -5.828 -0.012 6.067 0.648 16.217 1.007 2645 502
x[3] -5.235 0.009 5.729 0.575 17.723 1.012 2683 823
x[4] -6.250 -0.020 6.897 0.163 11.487 1.010 3627 644
x[5] -9.663 -0.048 5.109 -1.413 11.271 1.009 629 156
x[6] -5.257 -0.003 5.406 0.197 5.318 1.011 3060 883
x[7] -6.354 0.059 10.772 4.144 32.565 1.017 607 184
x[8] -6.445 -0.012 5.375 -0.219 7.662 1.004 2658 886
x[9] -6.531 -0.004 6.295 0.152 7.382 1.006 3128 901
x[10] -6.122 -0.011 5.916 0.056 5.743 1.006 2421 642
x[11] -6.728 0.001 6.150 0.034 9.682 1.008 2079 600
x[12] -5.678 -0.033 4.883 0.193 17.526 1.003 2633 774
x[13] -4.525 -0.055 4.269 0.090 6.283 1.003 3148 811
x[14] -4.880 -0.002 5.028 -0.023 5.931 1.004 1461 492
x[15] -14.490 -0.006 11.514 -1.411 22.739 1.032 486 160
x[16] -7.028 0.013 6.959 0.162 15.913 1.014 2329 463
x[17] -6.585 0.012 7.691 0.964 30.185 1.007 2292 446
x[18] -4.505 0.054 6.915 1.099 11.893 1.015 2640 447
x[19] -7.659 -0.047 6.079 -3.141 28.144 1.007 1147 298
x[20] -5.782 0.026 11.335 5.777 36.989 1.029 363 80
x[21] -4.894 0.022 5.379 0.105 5.453 1.008 3276 824
x[22] -5.594 0.042 5.372 0.493 15.375 1.010 2121 522
x[23] -14.448 0.008 6.995 -3.096 32.020 1.012 391 89
x[24] -5.932 -0.036 5.471 -1.007 16.613 1.015 1434 284
x[25] -7.068 -0.023 5.944 -1.758 21.257 1.009 1544 324
x[26] -8.962 -0.062 5.847 -1.970 17.357 1.011 1778 452
x[27] -8.813 -0.006 8.345 0.153 18.079 1.005 1816 352
x[28] -5.256 0.024 5.927 0.105 9.511 1.017 3776 675
x[29] -5.881 -0.002 5.895 -0.036 18.470 1.009 3642 846
x[30] -5.572 -0.016 5.141 -0.185 6.221 1.001 4363 643
x[31] -7.360 0.014 7.255 0.069 8.715 1.002 3384 896
x[32] -8.850 -0.038 6.497 -10.220 81.180 1.012 561 141
x[33] -4.787 0.031 4.959 -0.348 8.611 1.007 2626 735
x[34] -5.906 -0.034 5.372 -0.108 5.956 1.006 2408 634
x[35] -6.103 0.017 6.790 -0.328 14.648 1.010 2654 630
x[36] -4.861 0.097 15.598 19.301 108.424 1.044 155 34
x[37] -8.932 -0.001 7.856 -0.090 13.300 1.006 1155 427
x[38] -5.543 -0.023 4.830 -0.327 5.950 1.003 3353 576
x[39] -5.895 -0.027 5.836 0.107 7.294 1.016 4493 724
x[40] -8.707 0.012 6.715 -1.978 19.806 1.003 877 148
x[41] -6.884 -0.027 7.548 -0.270 11.284 1.003 1726 512
x[42] -6.155 0.014 6.480 0.091 15.506 1.014 1519 448
x[43] -6.385 0.000 7.392 0.113 12.102 1.009 2746 483
x[44] -7.843 0.005 7.992 0.414 11.498 1.008 2999 659
x[45] -4.766 -0.020 4.662 0.036 6.726 1.007 2904 1014
x[46] -4.840 0.004 5.986 1.161 22.071 1.004 955 338
x[47] -8.507 0.034 24.263 0.796 37.140 1.071 231 56
x[48] -6.734 0.001 5.331 -0.717 10.349 1.008 1907 469
x[49] -6.272 0.044 6.919 0.730 12.441 1.004 1490 390
x[50] -5.211 0.001 4.645 -0.221 7.078 1.001 3109 680
I 0.000 0.500 1.000 0.500 0.500 1.000 390 4000
lp__ -92.706 -69.208 -49.032 -69.536 13.349 1.054 117 323
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
which_min_ess <- which.min(mon[1:50, 'Tail_ESS'])
Several Rhat > 1.01 and some ESS < 400 indicate also that the results should not be trusted. The Appendix E has more results with longer chains as well.
We can further analyze potential problems using local efficiency and rank plots. We specifically investigate x[36], which has the smallest taill-ESS of 34.
We examine the sampling efficiency in different parts of the posterior by computing the efficiency of small interval probability estimates (see Section Efficiency estimate for small interval probability estimates). Each interval contains \(1/k\) of the draws (e.g., \(5\%\) each, if \(k=20\)). The small interval efficiency measures mixing of an function which indicates when the values are inside or outside the specific small interval. As detailed above, this gives us a local efficiency measure which does not depend on the shape of the distribution.
plot_local_ess(fit = fit_nom, par = which_min_ess, nalpha = 20)
We see that the efficiency of our posterior draws is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show iterations that exceeded the maximum treedepth.
An alternative way to examine the efficiency in different parts of the posterior is to compute efficiencies estimates for quantiles (see Section Efficiency for quantiles). Each interval has a specified proportion of draws, and the efficiency measures mixing of a function which indicates when the values are smaller than or equal to the corresponding quantile.
plot_quantile_ess(fit = fit_nom, par = which_min_ess, nalpha = 40)
Similar as above, we see that the efficiency of our posterior draws is worryingly low in the tails. Again, orange ticks show iterations that exceeded the maximum treedepth.
We may also investigate how the estimated effective sample sizes change when we use more and more draws (Brooks and Gelman (1998) proposed to use similar graph for \(\widehat{R}\)). If the effective sample size is highly unstable, does not increase proportionally with more draws, or even decreases, this indicates that simply running longer chains will likely not solve the convergence issues. In the plot below, we see how unstable both bulk-ESS and tail-ESS are for this example.
plot_change_ess(fit = fit_nom, par = which_min_ess)
We can further analyze potential problems using rank plots in which we clearly see differences between chains.
samp <- as.array(fit_nom)
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
Next we examine an alternative parameterization that considers the Cauchy distribution as a scale mixture of Gaussian distributions. The model has two parameters and the Cauchy distributed \(x\)’s can be computed from those. In addition to improved sampling performance, the example illustrates that focusing on diagnostics matters.
writeLines(readLines("cauchy_alt_1.stan"))
parameters {
vector[50] x_a;
vector<lower=0>[50] x_b;
}
transformed parameters {
vector[50] x = x_a ./ sqrt(x_b);
}
model {
x_a ~ normal(0, 1);
x_b ~ gamma(0.5, 0.5);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the alternative model:
fit_alt1 <- stan(file = 'cauchy_alt_1.stan', seed = 7878, refresh = 0)
There are no warnings, and the sampling is much faster.
mon <- monitor(fit_alt1)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
x_a[1] -1.648 -0.020 1.657 -0.015 0.992 1.001 4361 2961
x_a[2] -1.666 -0.002 1.631 -0.012 1.008 1.001 4015 3167
x_a[3] -1.662 -0.026 1.617 -0.010 0.997 1.001 4190 3013
x_a[4] -1.625 -0.011 1.691 0.006 1.009 1.000 3626 2822
x_a[5] -1.636 -0.007 1.578 -0.011 0.985 1.002 3447 3029
x_a[6] -1.623 -0.009 1.558 -0.011 0.971 1.001 3975 2795
x_a[7] -1.607 0.003 1.653 0.011 1.008 1.000 3919 2271
x_a[8] -1.672 -0.028 1.604 -0.024 1.005 1.001 3736 3040
x_a[9] -1.626 -0.037 1.549 -0.029 0.986 1.002 3852 3075
x_a[10] -1.655 -0.033 1.738 0.000 1.022 1.002 3457 2810
x_a[11] -1.555 -0.016 1.551 -0.013 0.966 1.002 3532 2822
x_a[12] -1.638 0.002 1.690 0.009 1.004 1.001 3462 3036
x_a[13] -1.590 0.012 1.635 0.008 0.989 1.000 3479 2764
x_a[14] -1.718 -0.026 1.625 -0.029 1.013 1.002 3872 3179
x_a[15] -1.650 0.015 1.697 0.023 1.018 1.000 3843 2855
x_a[16] -1.675 -0.006 1.624 -0.012 1.009 1.001 3610 3047
x_a[17] -1.631 -0.008 1.662 -0.007 0.989 1.001 4363 3076
x_a[18] -1.654 -0.010 1.688 -0.008 1.005 1.001 4636 3088
x_a[19] -1.613 -0.020 1.588 -0.023 0.975 1.000 4055 3037
x_a[20] -1.643 -0.006 1.618 0.000 1.001 1.001 4340 2792
x_a[21] -1.671 0.040 1.683 0.025 1.033 1.000 3628 2719
x_a[22] -1.628 -0.013 1.645 -0.010 1.009 1.000 3976 2775
x_a[23] -1.617 0.034 1.698 0.018 1.013 1.002 4266 2682
x_a[24] -1.662 0.002 1.667 -0.001 1.000 1.002 3710 2889
x_a[25] -1.636 0.016 1.663 0.004 0.996 1.001 4408 3062
x_a[26] -1.647 -0.017 1.613 -0.005 1.003 1.000 3854 2674
x_a[27] -1.663 -0.001 1.680 0.001 1.013 1.002 3331 2944
x_a[28] -1.687 0.018 1.681 0.016 1.014 1.002 4055 2689
x_a[29] -1.575 0.002 1.642 0.018 0.990 1.000 4097 3213
x_a[30] -1.609 -0.005 1.644 0.011 0.988 1.000 3961 2900
x_a[31] -1.648 0.013 1.605 -0.005 0.994 1.000 4097 3088
x_a[32] -1.588 -0.006 1.590 -0.010 0.969 1.000 4189 3171
x_a[33] -1.688 -0.010 1.770 0.004 1.049 1.001 3853 2594
x_a[34] -1.684 0.009 1.691 -0.004 1.006 1.000 4012 2787
x_a[35] -1.613 0.026 1.675 0.027 1.000 1.001 3920 2829
x_a[36] -1.667 0.004 1.604 -0.007 0.988 1.001 4107 2797
x_a[37] -1.687 -0.027 1.593 -0.018 1.000 1.002 3956 2942
x_a[38] -1.666 -0.018 1.638 -0.016 1.018 1.002 4210 2919
x_a[39] -1.655 -0.025 1.647 -0.016 1.016 1.000 4482 2869
x_a[40] -1.629 0.003 1.618 0.000 0.999 1.003 4139 2848
x_a[41] -1.684 0.016 1.614 -0.003 0.990 1.004 3963 2962
x_a[42] -1.645 0.028 1.636 0.020 1.001 1.001 4161 3077
x_a[43] -1.638 -0.030 1.633 -0.015 1.008 1.000 3808 2849
x_a[44] -1.655 -0.033 1.630 -0.017 1.004 1.001 3743 2865
x_a[45] -1.682 0.019 1.731 0.009 1.017 1.001 3783 2546
x_a[46] -1.560 0.044 1.664 0.049 0.970 1.000 4340 3277
x_a[47] -1.660 0.018 1.583 -0.004 0.996 1.000 4283 2946
x_a[48] -1.607 0.003 1.656 0.002 0.981 1.000 4001 2779
x_a[49] -1.617 0.002 1.641 0.003 0.998 1.000 3906 3099
x_a[50] -1.625 -0.006 1.583 0.000 0.987 1.001 3794 2962
x_b[1] 0.003 0.445 3.997 1.002 1.443 1.004 2268 1264
x_b[2] 0.004 0.456 3.752 0.996 1.380 1.002 2444 1428
x_b[3] 0.004 0.467 3.892 1.035 1.457 1.000 3578 1950
x_b[4] 0.004 0.456 3.830 1.010 1.406 1.001 2693 1342
x_b[5] 0.004 0.452 3.951 1.026 1.463 1.000 3056 1731
x_b[6] 0.004 0.440 3.803 1.001 1.412 1.001 3264 1786
x_b[7] 0.005 0.432 3.618 0.949 1.324 1.000 2888 1907
x_b[8] 0.004 0.443 3.792 1.008 1.403 1.000 2876 1578
x_b[9] 0.005 0.496 3.674 0.998 1.364 1.000 2820 1535
x_b[10] 0.003 0.441 3.789 0.987 1.392 1.000 2534 1699
x_b[11] 0.006 0.495 3.900 1.028 1.438 1.001 3595 2032
x_b[12] 0.004 0.440 3.822 0.987 1.401 1.002 2405 1200
x_b[13] 0.003 0.462 3.973 1.031 1.454 1.000 2045 1073
x_b[14] 0.004 0.441 3.970 1.032 1.486 1.000 2829 1443
x_b[15] 0.005 0.450 3.770 1.011 1.410 1.000 2853 1447
x_b[16] 0.004 0.477 3.794 1.006 1.426 1.000 2661 1604
x_b[17] 0.006 0.459 3.929 1.023 1.425 1.001 2775 1477
x_b[18] 0.006 0.505 4.106 1.065 1.534 1.001 2689 1170
x_b[19] 0.004 0.450 3.918 1.001 1.408 1.000 2392 1450
x_b[20] 0.003 0.425 3.964 0.985 1.423 1.003 2296 1240
x_b[21] 0.005 0.488 3.939 1.037 1.431 1.000 3069 1848
x_b[22] 0.004 0.451 3.946 1.016 1.455 1.000 3012 1733
x_b[23] 0.004 0.459 3.945 1.005 1.420 1.002 1787 1093
x_b[24] 0.003 0.436 3.862 1.002 1.428 1.000 1903 1008
x_b[25] 0.004 0.452 3.661 0.979 1.390 1.000 2348 1094
x_b[26] 0.005 0.472 4.046 1.029 1.457 1.001 2421 1549
x_b[27] 0.004 0.453 3.900 1.011 1.412 1.000 2777 1470
x_b[28] 0.004 0.456 3.792 0.984 1.368 1.000 3353 1699
x_b[29] 0.006 0.458 3.871 1.014 1.430 1.000 3428 1997
x_b[30] 0.004 0.437 3.885 1.009 1.430 1.000 2833 1554
x_b[31] 0.004 0.487 3.837 1.015 1.408 1.001 3035 1633
x_b[32] 0.003 0.432 3.753 0.966 1.359 1.001 2276 1602
x_b[33] 0.004 0.452 3.789 1.000 1.408 1.000 3093 1888
x_b[34] 0.004 0.468 3.970 1.026 1.452 1.000 3309 1650
x_b[35] 0.004 0.482 3.841 1.015 1.424 1.002 2493 1588
x_b[36] 0.005 0.443 3.894 0.988 1.395 1.000 3108 1876
x_b[37] 0.004 0.459 3.702 0.979 1.352 1.000 2644 1322
x_b[38] 0.005 0.454 3.930 1.012 1.449 1.001 3155 1776
x_b[39] 0.004 0.450 3.752 0.987 1.417 1.001 2038 934
x_b[40] 0.003 0.419 3.761 0.938 1.313 1.000 2657 1403
x_b[41] 0.005 0.461 3.784 0.998 1.379 1.000 2648 1370
x_b[42] 0.002 0.450 3.894 1.001 1.426 1.001 2334 1365
x_b[43] 0.004 0.470 4.030 1.032 1.443 1.000 2967 1797
x_b[44] 0.004 0.430 3.688 0.975 1.367 1.000 2557 1591
x_b[45] 0.005 0.442 3.656 0.962 1.300 1.001 2731 1785
x_b[46] 0.004 0.458 3.745 1.014 1.402 1.001 2538 1183
x_b[47] 0.007 0.472 3.833 1.007 1.392 1.001 3948 2071
x_b[48] 0.006 0.480 3.891 1.020 1.394 1.000 3207 1917
x_b[49] 0.005 0.474 3.737 0.999 1.347 1.001 2550 1533
x_b[50] 0.003 0.460 4.012 1.004 1.479 1.001 2881 1395
x[1] -6.472 -0.019 6.520 0.011 34.878 1.006 3901 2122
x[2] -6.516 -0.002 6.499 3.385 145.630 1.001 3767 1947
x[3] -6.313 -0.030 5.948 -0.083 16.172 1.001 3681 2514
x[4] -6.764 -0.013 5.864 -1.256 50.733 1.000 3244 2159
x[5] -6.673 -0.012 5.755 -0.313 30.289 1.002 3355 2430
x[6] -5.640 -0.013 6.476 -145.798 6461.971 1.000 3802 2536
x[7] -6.593 0.003 6.375 -0.581 22.645 1.000 3480 2508
x[8] -6.496 -0.035 6.294 -0.037 20.613 1.001 3474 2374
x[9] -5.731 -0.044 5.961 -0.920 55.636 1.002 3493 2262
x[10] -6.369 -0.042 6.430 -0.199 25.217 1.001 2871 2286
x[11] -5.613 -0.020 5.549 0.120 12.361 1.001 3423 2587
x[12] -7.122 0.002 6.187 0.407 91.826 1.003 3340 2329
x[13] -6.444 0.020 6.216 -5.647 201.971 1.001 3332 2164
x[14] -6.716 -0.040 6.564 -2.937 134.969 1.000 3693 2257
x[15] -5.681 0.017 6.059 -0.864 30.387 1.001 3511 1969
x[16] -6.987 -0.011 7.244 -0.402 24.589 1.002 3614 2406
x[17] -5.778 -0.013 5.387 -0.063 25.139 1.001 3880 2520
x[18] -5.450 -0.012 5.915 -0.550 259.421 1.002 4303 2254
x[19] -7.157 -0.024 5.841 9.852 549.143 1.000 3505 1931
x[20] -7.392 -0.008 6.677 -3.089 145.629 1.004 3411 1916
x[21] -5.933 0.049 5.852 0.229 45.618 1.000 3419 2179
x[22] -6.956 -0.019 6.226 -0.196 27.060 1.000 3554 2149
x[23] -5.993 0.045 6.540 2.509 114.592 1.001 3456 2098
x[24] -6.747 0.002 6.991 -0.672 111.425 1.003 3541 1922
x[25] -6.326 0.018 6.398 -2.711 100.245 1.000 4036 2356
x[26] -5.571 -0.021 6.319 0.195 14.470 1.001 3335 2319
x[27] -6.401 -0.002 6.167 -0.784 52.601 1.001 3377 2400
x[28] -5.698 0.023 6.490 3.105 105.247 1.000 3446 2121
x[29] -5.403 0.001 5.672 -0.124 18.134 1.000 3918 2430
x[30] -5.830 -0.005 6.453 0.998 51.195 1.000 3642 2138
x[31] -5.852 0.017 5.660 0.864 26.311 1.002 3595 2271
x[32] -6.487 -0.013 6.184 0.114 50.688 1.001 4227 2731
x[33] -7.079 -0.008 6.339 -0.410 23.461 1.001 3807 2376
x[34] -6.590 0.011 6.051 0.953 48.192 1.001 4084 2358
x[35] -5.438 0.030 6.130 0.318 48.428 1.001 3756 2247
x[36] -6.153 0.004 6.110 -0.057 28.418 1.000 3600 2386
x[37] -6.076 -0.028 5.339 0.700 59.912 1.001 3623 2005
x[38] -5.741 -0.021 5.767 0.200 13.158 1.003 3820 2536
x[39] -6.734 -0.034 5.840 2.770 285.451 1.001 4121 1944
x[40] -6.390 0.006 6.517 -2.748 102.743 1.000 3612 1823
x[41] -6.012 0.022 5.923 -0.421 35.694 1.001 3492 2222
x[42] -7.394 0.045 6.859 0.495 22.471 1.002 3558 1949
x[43] -5.982 -0.034 6.685 14.356 626.203 1.000 3823 2516
x[44] -7.035 -0.039 6.174 1.545 106.262 1.001 3310 2239
x[45] -5.747 0.021 6.227 -0.427 32.277 1.001 3752 2437
x[46] -5.593 0.060 6.331 -0.318 75.995 1.002 3898 1976
x[47] -5.492 0.029 5.426 -0.021 12.082 1.001 3893 2659
x[48] -5.963 0.003 4.850 -0.013 21.020 1.000 3674 2274
x[49] -6.546 0.003 5.248 -1.235 129.152 1.002 3576 2243
x[50] -6.735 -0.008 6.895 -1.061 147.106 1.001 3437 2486
I 0.000 0.000 1.000 0.499 0.500 1.002 2648 4000
lp__ -95.188 -80.991 -68.660 -81.342 8.076 1.002 1310 1928
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
which_min_ess <- which.min(mon[101:150, 'Tail_ESS'])
All Rhat < 1.01 and ESS > 400 indicate the sampling worked much better with the alternative parameterization. Appendix E has more results using other alternative parameterizations. The x_a and x_b used to form the Cauchy distributed x have stable quantile, mean and sd values. As x is Cauchy distributed it has only stable quantiles, but wildly varying mean and sd estimates as the true values are not finite.
We can further analyze potential problems using local efficiency estimates and rank plots. We take a detailed look at x[40], which has the smallest bulk-ESS of 2848.
We examine the sampling efficiency in different parts of the posterior by computing the efficiency estimates for small interval probability estimates.
plot_local_ess(fit = fit_alt1, par = which_min_ess + 100, nalpha = 20)
The efficiency estimate is good in all parts of the posterior. Further, we examine the sampling efficiency of different quantile estimates.
plot_quantile_ess(fit = fit_alt1, par = which_min_ess + 100, nalpha = 40)
Rank plots also look rather similar across chains.
samp <- as.array(fit_alt1)
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
In summary, the alternative parameterization produces results that look much better than for the nominal parameterization. There are still some differences in the tails, which we also identified via the tail-ESS.
Half-Cauchy priors are common and, for example, in Stan usually set using the nominal parameterization. However, when the constraint <lower=0> is used, Stan does the sampling automatically in the unconstrained log(x) space, which changes the geometry crucially.
writeLines(readLines("half_cauchy_nom.stan"))
parameters {
vector<lower=0>[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the half-Cauchy with nominal parameterization (and positive constraint):
fit_half_nom <- stan(file = 'half_cauchy_nom.stan', seed = 7878, refresh = 0)
There are no warnings, and the sampling is much faster than for the Cauchy nominal model.
mon <- monitor(fit_half_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
x[1] 0.081 1.033 13.559 11.119 388.184 1.004 8077 2223
x[2] 0.096 1.017 10.788 11.640 482.364 1.000 9868 2612
x[3] 0.063 1.002 12.855 5.403 71.621 1.000 7895 2097
x[4] 0.094 0.982 11.501 4.405 29.482 1.000 7596 2347
x[5] 0.076 1.034 14.015 4.518 19.853 1.000 7495 2230
x[6] 0.081 0.992 11.716 12.523 379.984 1.000 7948 2145
x[7] 0.073 0.978 12.022 9.227 280.919 1.000 8336 2117
x[8] 0.090 0.988 10.818 4.615 40.828 1.000 8194 2165
x[9] 0.090 1.020 11.555 11.672 483.091 1.001 8464 2127
x[10] 0.072 1.026 14.900 6.007 50.981 1.000 7963 2399
x[11] 0.075 0.973 12.544 15.546 532.285 1.002 6980 1788
x[12] 0.073 1.031 13.331 4.099 17.470 1.001 8226 2029
x[13] 0.077 1.012 14.110 8.881 120.144 1.003 8077 2032
x[14] 0.068 0.960 13.479 6.910 71.522 1.002 7455 2261
x[15] 0.070 0.996 14.426 8.658 95.803 1.003 6766 2246
x[16] 0.075 0.991 14.058 5.554 38.453 1.001 7396 2174
x[17] 0.082 0.986 11.704 8.699 299.762 1.001 7144 2260
x[18] 0.085 0.983 11.561 14.423 397.592 1.001 7614 2035
x[19] 0.073 0.988 14.315 6.334 69.060 1.001 7823 1825
x[20] 0.090 1.001 11.205 6.402 160.214 1.000 7905 2355
x[21] 0.073 0.997 12.011 7.028 144.132 0.999 7843 2078
x[22] 0.087 1.022 12.693 32.183 1733.463 1.000 7735 2043
x[23] 0.070 0.981 13.460 6.325 67.345 1.002 7119 2142
x[24] 0.073 0.996 12.310 4.985 41.550 1.002 6893 1982
x[25] 0.090 1.000 11.527 8.602 270.332 1.002 7757 2351
x[26] 0.085 0.985 10.679 6.968 119.110 1.001 6433 2230
x[27] 0.077 1.009 13.373 4.882 34.296 1.000 7005 2119
x[28] 0.081 0.972 11.410 5.757 66.701 1.002 9631 2228
x[29] 0.074 0.996 13.681 10.374 239.906 1.001 6109 2312
x[30] 0.089 1.025 10.986 7.981 206.413 1.002 7958 2368
x[31] 0.078 0.964 12.417 4.395 32.129 1.001 6493 2102
x[32] 0.097 1.012 11.460 4.987 55.377 1.002 7043 1742
x[33] 0.076 0.987 13.765 5.455 36.802 1.000 6913 2455
x[34] 0.077 0.981 12.378 5.922 78.208 1.002 8610 2514
x[35] 0.075 0.965 13.567 5.696 57.446 1.002 6406 2160
x[36] 0.064 1.003 13.540 5.365 39.600 1.001 7694 2031
x[37] 0.071 1.000 13.867 8.289 195.606 1.001 7276 2491
x[38] 0.083 1.018 12.235 6.399 119.056 1.002 6790 2369
x[39] 0.095 1.013 11.687 6.667 88.506 1.000 7739 2518
x[40] 0.092 0.963 12.061 5.158 44.414 1.000 7087 2349
x[41] 0.070 0.962 12.868 4.969 42.302 1.000 8650 2333
x[42] 0.071 1.020 13.331 7.770 132.343 1.004 8703 2410
x[43] 0.076 0.970 10.254 26.620 1404.418 1.003 8747 2194
x[44] 0.079 1.035 12.290 5.961 59.293 1.003 6378 2257
x[45] 0.080 0.980 12.219 7.703 123.023 1.001 8430 2314
x[46] 0.082 0.990 12.058 5.358 72.186 1.001 8185 2237
x[47] 0.084 1.009 14.500 7.003 76.721 1.000 9562 2265
x[48] 0.076 1.007 13.053 5.063 30.438 1.000 8402 2690
x[49] 0.079 0.999 12.928 7.479 99.983 1.001 7993 1804
x[50] 0.084 0.997 13.215 8.861 178.683 1.002 7523 2243
I 0.000 0.000 1.000 0.487 0.500 0.999 7357 4000
lp__ -80.633 -69.062 -59.314 -69.324 6.415 1.002 1218 2001
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
All Rhat < 1.01 and ESS > 400 indicate good performance of the sampler. We see that the Stan’s automatic (and implicit) transformation of constraint parameters can have a big effect on the sampling performance. More experiments with different parameterizations of the half-Cauchy distribution can be found in Appendix E.
The Eight Schools data is a classic example for hierarchical models (see Section 5.5 in Gelman et al., 2013), which despite the apparent simplicity nicely illustrates the typical problems in inference for hierarchical models. The Stan models below are from Michael Betancourt’s case study on Diagnosing Biased Inference with Divergences. Appendix F contains more detailed analysis of different algorithm variants.
writeLines(readLines("eight_schools_cp.stan"))
data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}
parameters {
real mu;
real<lower=0> tau;
real theta[J];
}
model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta ~ normal(mu, tau);
y ~ normal(theta, sigma);
}
We directly run the centered parameterization model with an increased adapt_delta value to reduce the probability of getting divergent transitions.
eight_schools <- read_rdump("eight_schools.data.R")
fit_cp <- stan(
file = 'eight_schools_cp.stan', data = eight_schools,
iter = 2000, chains = 4, seed = 483892929, refresh = 0,
control = list(adapt_delta = 0.95)
)
Warning: There were 113 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
Despite an increased adapt_delta, we still observe a lot of divergent transitions, which in itself is already sufficient indicator to not trust the results. We can use Rhat and ESS diagnostics to recognize problematic parts of the posterior and they could be used in cases when other MCMC algorithms than HMC is used.
mon <- monitor(fit_cp)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -1.113 4.531 9.895 4.441 3.386 1.017 548 754
tau 0.391 2.846 9.611 3.620 3.098 1.069 67 82
theta[1] -2.238 5.814 16.293 6.226 5.739 1.016 747 1294
theta[2] -2.601 5.067 13.433 5.077 4.860 1.012 970 1240
theta[3] -5.013 4.355 12.111 3.938 5.332 1.009 899 1147
theta[4] -2.858 4.999 12.816 4.893 4.825 1.010 986 1059
theta[5] -4.742 4.030 10.829 3.661 4.808 1.013 715 988
theta[6] -4.155 4.281 11.595 4.078 4.838 1.010 833 976
theta[7] -1.299 5.974 15.589 6.311 5.184 1.019 612 1182
theta[8] -3.371 5.117 13.844 4.975 5.338 1.010 901 1477
lp__ -24.682 -15.041 0.371 -14.005 7.438 1.068 69 89
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
See Appendix F for results of longer chains.
Bulk-ESS and Tail-ESS for the between school standard deviation tau are 67 and 82 respectively. Both are less than 400, indicating we should investigate that parameter more carefully. We thus examine the sampling efficiency in different parts of the posterior by computing the efficiency estimate for small interval estimates for tau. These plots may either show quantiles or parameter values at the vertical axis. Red ticks show divergent transitions.
plot_local_ess(fit = fit_cp, par = "tau", nalpha = 20)
plot_local_ess(fit = fit_cp, par = "tau", nalpha = 20, rank = FALSE)
We see that the sampler has difficulties in exploring small tau values. As the sampling efficiency for estimating small tau values is practically zero, we may assume that we may miss substantial amount of posterior mass and get biased estimates. Red ticks, which show iterations with divergences, have concentrated to small tau values, indicate also problems exploring small values which is likely to cause bias.
We examine also the sampling efficiency of different quantile estimates. Again, these plots may either show quantiles or parameter values at the vertical axis.
plot_quantile_ess(fit = fit_cp, par = "tau", nalpha = 40)
plot_quantile_ess(fit = fit_cp, par = "tau", nalpha = 40, rank = FALSE)
Most of the quantile estimates have worryingly low effective sample size.
Let’s see how the estimated effective sample size changes when we use more and more draws. Here we don’t see sudden changes, but both bulk-ESS and tail-ESS are too low. See Appendix F for results of longer chains.
plot_change_ess(fit = fit_cp, par = "tau")
In lines with these findings, the rank plots of tau clearly show problems in the mixing of the chains.
samp_cp <- as.array(fit_cp)
mcmc_hist_r_scale(samp_cp[, , "tau"])
For hierarchical models, the non-centered parameterization often works better than the centered one:
writeLines(readLines("eight_schools_ncp.stan"))
data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}
parameters {
real mu;
real<lower=0> tau;
real theta_tilde[J];
}
transformed parameters {
real theta[J];
for (j in 1:J)
theta[j] = mu + tau * theta_tilde[j];
}
model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta_tilde ~ normal(0, 1);
y ~ normal(theta, sigma);
}
For reasons of comparability, we also run the non-centered parameterization model with an increased adapt_delta value:
fit_ncp2 <- stan(
file = 'eight_schools_ncp.stan', data = eight_schools,
iter = 2000, chains = 4, seed = 483892929, refresh = 0,
control = list(adapt_delta = 0.95)
)
We get zero divergences and no other warnings which is a first good sign.
mon <- monitor(fit_ncp2)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -1.136 4.418 9.960 4.466 3.367 1.000 5531 3004
tau 0.299 2.810 9.501 3.588 3.174 1.000 2872 1908
theta_tilde[1] -1.295 0.307 1.880 0.307 0.979 1.003 5046 2874
theta_tilde[2] -1.445 0.109 1.623 0.101 0.937 1.001 4177 2735
theta_tilde[3] -1.644 -0.075 1.481 -0.078 0.966 1.001 6485 2994
theta_tilde[4] -1.512 0.084 1.622 0.064 0.947 1.003 6076 2514
theta_tilde[5] -1.710 -0.173 1.387 -0.162 0.939 1.002 5608 3177
theta_tilde[6] -1.642 -0.060 1.528 -0.065 0.967 1.001 4855 2773
theta_tilde[7] -1.252 0.396 1.867 0.367 0.955 1.001 4796 2849
theta_tilde[8] -1.536 0.072 1.662 0.062 0.964 1.000 6142 2972
theta[1] -1.619 5.639 16.310 6.246 5.576 1.003 4907 3015
theta[2] -2.415 4.819 12.957 5.013 4.680 1.000 5122 3242
theta[3] -4.631 4.261 12.105 4.090 5.270 1.000 5457 3407
theta[4] -2.733 4.754 12.485 4.817 4.780 1.001 4695 3130
theta[5] -3.956 3.820 11.023 3.717 4.623 1.001 5346 3398
theta[6] -3.883 4.271 11.437 4.127 4.953 1.000 5670 3393
theta[7] -0.979 5.880 15.429 6.375 5.101 1.000 4708 3242
theta[8] -3.234 4.780 13.129 4.869 5.175 1.000 4924 3037
lp__ -11.193 -6.600 -3.776 -6.925 2.314 1.001 1641 2344
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
All Rhat < 1.01 and ESS > 400 indicate a much better performance of the non-centered parameterization.
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates for tau.
plot_local_ess(fit = fit_ncp2, par = 2, nalpha = 20)
Small tau values are still more difficult to explore, but the relative efficiency is in a good range. We may also check this with a finer resolution:
plot_local_ess(fit = fit_ncp2, par = 2, nalpha = 50)
The sampling efficiency for different quantile estimates looks good as well.
plot_quantile_ess(fit = fit_ncp2, par = 2, nalpha = 40)
In line with these findings, the rank plots of tau show no substantial differences between chains.
samp_ncp2 <- as.array(fit_ncp2)
mcmc_hist_r_scale(samp_ncp2[, , 2])
Betancourt, M. (2017) ‘A conceptual introduction to hamiltonian monte carlo’, arXiv preprint arXiv:1701.02434.
Brooks, S. P. and Gelman, A. (1998) ‘General methods for monitoring convergence of iterative simulations’, Journal of Computational and Graphical Statistics, 7(4), pp. 434–455.
Gelman, A. et al. (2013) Bayesian data analysis, third edition. CRC Press.
Gelman, A. and Rubin, D. B. (1992) ‘Inference from iterative simulation using multiple sequences’, Statistical science, 7(4), pp. 457–472.
Geyer, C. J. (1992) ‘Practical Markov chain Monte Carlo’, Statistical Science, 7, pp. 473–483.
Geyer, C. J. (2011) ‘Introduction to Markov chain Monte Carlo’, in Brooks, S. et al. (eds) Handbook of markov chain monte carlo. CRC Press.
Hoffman, M. D. and Gelman, A. (2014) ‘The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo’, Journal of Machine Learning Research, 15, pp. 1593–1623. Available at: http://jmlr.org/papers/v15/hoffman14a.html.
Stan Development Team (2018a) Bayesian statistics using stan. Stan Development Team. Available at: https://github.com/stan-dev/stan-book.
Stan Development Team (2018b) ‘RStanArm: Bayesian applied regression modeling via Stan. R package version 2.17.4’. Available at: http://mc-stan.org.
Stan Development Team (2018c) ‘RStan: The R interface to Stan. R package version 2.17.3’. Available at: http://mc-stan.org.
Stan Development Team (2018d) ‘Stan modeling language users guide and reference manual. Version 2.18.0’. Available at: http://mc-stan.org.
Stan Development Team (2018e) ‘The Stan core library version 2.18.0’. Available at: http://mc-stan.org.
The following abbreviations are used throughout the appendices:
We will illustrate the rank normalization with a few examples. First, we plot histograms, and empirical cumulative distribution functions (ECDF) with respect to the original parameter values (\(\theta\)), scaled ranks (ranks divided by the maximum rank), and rank normalized values (z). We used scaled ranks to make the plots look similar for different number of draws.
100 draws from Normal(0, 1):
n <- 100
theta <- rnorm(n)
plot_ranknorm(theta, n)
100 draws from Exponential(1):
theta <- rexp(n)
plot_ranknorm(theta, n)
100 draws from Cauchy(0, 1):
theta <- rcauchy(n)
plot_ranknorm(theta, n)
In the above plots, the ECDF with respect to scaled rank and rank normalized \(z\)-values look exactly the same for all distributions. In Split-\(\widehat{R}\) and effective sample size computations, we rank all draws jointly, but then compare ranks and ECDF of individual split chains. To illustrate the variation between chains, we draw 8 batches of 100 draws each from Normal(0, 1):
n <- 100
m <- 8
theta <- rnorm(n * m)
plot_ranknorm(theta, n, m)
The variation in ECDF due to the variation ranks is now visible also in scaled ranks and rank normalized \(z\)-values from different batches (chains).
The benefit of rank normalization is more obvious for non-normal distribution such as Cauchy:
theta <- rcauchy(n * m)
plot_ranknorm(theta, n, m)
Rank normalization makes the subsequent computations well defined and invariant under bijective transformations. This means that we get the same results, for example, if we use unconstrained or constrained parameterisations in a model.
In Section 3, we had defined the empirical CDF (ECDF) for any \(\theta_\alpha\) as \[ p(\theta \leq \theta_\alpha) \approx \bar{I}_\alpha = \frac{1}{S}\sum_{s=1}^S I(\theta^{(s)} \leq\theta_\alpha), \]
For independent draws, \(\bar{I}_\alpha\) has a \({\rm Beta}(\bar{I}_\alpha+1, S - \bar{I}_\alpha + 1)\) distribution. Thus we can easily examine the variation of the ECDF for any \(\theta_\alpha\) value from a single chain. If \(\bar{I}_\alpha\) is not very close to \(1\) or \(S\) and \(S\) is large, we can use the variance of Beta distribution
\[ {\rm Var}[p(\theta \leq \theta_\alpha)] = \frac{(\bar{I}_\alpha+1)*(S-\bar{I}_\alpha+1)}{(S+2)^2(S+3)}. \] We illustrate uncertainty intervals of the Beta distribution and normal approximation of ECDF for 100 draws from Normal(0, 1):
n <- 100
m <- 1
theta <- rnorm(n * m)
plot_ranknorm(theta, n, m, interval = TRUE)
Uncertainty intervals of ECDF for draws from Cauchy(0, 1) illustrate again the improved visual clarity in plotting when using scaled ranks:
n <- 100
m <- 1
theta <- rcauchy(n * m)
plot_ranknorm(theta, n, m, interval = TRUE)
The above plots illustrate that the normal approximation is accurate for practical purposes in MCMC diagnostics.
This part focuses on diagnostics for
To simplify, in this part, independent draws are used as a proxy for very efficient MCMC sampling. First, we sample draws from a standard-normal distribution. We will discuss the behavior for non-normal distributions later. See Appendix A for the abbreviations used.
All chains are from the same Normal(0, 1) distribution plus a linear trend added to all chains:
conds <- expand.grid(
iters = c(250, 1000, 4000),
trend = c(0, 0.25, 0.5, 0.75, 1),
rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
iters <- conds[i, "iters"]
trend <- conds[i, "trend"]
rep <- conds[i, "rep"]
r <- array(rnorm(iters * chains), c(iters, chains))
r <- r + seq(-trend, trend, length.out = iters)
rs <- as.data.frame(monitor_extra(r))
res[[i]] <- cbind(iters, trend, rep, rs)
}
res <- bind_rows(res)
If we don’t split chains, Rhat misses the trends if all chains still have a similar marginal distribution.
ggplot(data = res, aes(y = Rhat, x = trend)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Rhat without splitting chains')
Split-Rhat can detect trends, even if the marginals of the chains are similar.
ggplot(data = res, aes(y = zsRhat, x = trend)) +
geom_point() + geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Split-Rhat')
Result: Split-Rhat is useful for detecting non-stationarity (i.e., trends) in the chains. If we use a threshold of \(1.01\), we can detect trends which account for 2% or more of the total marginal variance. If we use a threshold of \(1.1\), we detect trends which account for 30% or more of the total marginal variance.
The effective sample size is based on split Rhat and within-chain autocorrelation. We plot the relative efficiency \(R_{\rm eff}=S_{\rm eff}/S\) for easier comparison between different values of \(S\). In the plot below, dashed lines indicate the threshold at which we would consider the effective sample size to be sufficient (i.e., \(S_{\rm eff} > 400\)). Since we plot the relative efficiency instead of the effective sample size itself, this threshold is divided by \(S\), which we compute here as the number of iterations per chain (variable iter) times the number of chains (\(4\)).
ggplot(data = res, aes(y = zsreff, x = trend)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = c(0,1)) +
geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') +
ggtitle('Relative Bulk-ESS (zsreff)') +
scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))
Result: Split-Rhat is more sensitive to trends for small sample sizes, but effective sample size becomes more sensitive for larger samples sizes (as autocorrelations can be estimated more accurately).
Advice: If in doubt, run longer chains for more accurate convergence diagnostics.
Next we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others, but has a different mean.
conds <- expand.grid(
iters = c(250, 1000, 4000),
shift = c(0, 0.25, 0.5, 0.75, 1),
rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
iters <- conds[i, "iters"]
shift <- conds[i, "shift"]
rep <- conds[i, "rep"]
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] + shift
rs <- as.data.frame(monitor_extra(r))
res[[i]] <- cbind(iters, shift, rep, rs)
}
res <- bind_rows(res)
ggplot(data = res, aes(y = zsRhat, x = shift)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Split-Rhat')
Result: If we use a threshold of \(1.01\), we can detect shifts with a magnitude of one third or more of the marginal standard deviation. If we use a threshold of \(1.1\), we detect shifts with a magnitude equal to or larger than the marginal standard deviation.
ggplot(data = res, aes(y = zsreff, x = shift)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = c(0,1)) +
geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') +
ggtitle('Relative Bulk-ESS (zsreff)') +
scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))
Result: The effective sample size is not as sensitive, but a shift with a magnitude of half the marginal standard deviation or more will lead to very low relative efficiency when the total number of draws increases.
Rank plots can be used to visualize differences between chains. Here, we show rank plots for the case of 4 chains, 250 draws per chain, and a shift of 0.5.
iters = 250
chains = 4
shift = 0.5
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] + shift
colnames(r) <- 1:4
mcmc_hist_r_scale(r)
Although, Rhat was less than \(1.05\) for this situation, the rank plots clearly show that the first chains behaves differently.
Next, we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others, but has lower marginal variance.
conds <- expand.grid(
iters = c(250, 1000, 4000),
scale = c(0, 0.25, 0.5, 0.75, 1),
rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
iters <- conds[i, "iters"]
scale <- conds[i, "scale"]
rep <- conds[i, "rep"]
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] * scale
rs <- as.data.frame(monitor_extra(r))
res[[i]] <- cbind(iters, scale, rep, rs)
}
res <- bind_rows(res)
We first look at the Rhat estimates:
ggplot(data = res, aes(y = zsRhat, x = scale)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Split-Rhat')
Result: Split-Rhat is not able to detect scale differences between chains.
ggplot(data = res, aes(y = zfsRhat, x = scale)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Folded-split-Rhat')
Result: Folded-Split-Rhat focuses on scales and detects scale differences.
Result: If we use a threshold of \(1.01\), we can detect a chain with scale less than \(3/4\) of the standard deviation of the others. If we use threshold of \(1.1\), we detect a chain with standard deviation less than \(1/4\) of the standard deviation of the others.
Next, we look at the effective sample size estimates:
ggplot(data = res, aes(y = zsreff, x = scale)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = c(0,1)) +
geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') +
ggtitle('Relative Bulk-ESS (zsreff)') +
scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))
Result: The bulk effective sample size of the mean does not see a problem as it focuses on location differences between chains.
ggplot(data = res, aes(y = tailreff, x = scale)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 0) +
geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') +
ggtitle('Relative tail-ESS (tailreff)') +
scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))
Result: The tail effective sample size detects the difference in scales as it focuses on the scale of the chains.
Rank plots can be used to visualize differences between chains. Here, we show rank plots for the case of 4 chains, 250 draws per chain, and with one chain having a standard deviation of 0.75 as opposed to a standard deviation of 1 for the other chains.
iters = 250
chains = 4
scale = 0.75
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] * scale
colnames(r) <- 1:4
mcmc_hist_r_scale(r)
Although folded Rhat is \(1.06\), the rank plots clearly show that the first chains behaves differently.
The classic split-Rhat is based on calculating within and between chain variances. If the marginal distribution of a chain is such that the variance is not defined (i.e. infinite), the classic split-Rhat is not well justified. In this section, we will use the Cauchy distribution as an example of such distribution. Also in cases where mean and variance are finite, the distribution can be far from Gaussian. Especially distributions with very long tails cause instability for variance and autocorrelation estimates. To alleviate these problems we will use Split-Rhat for rank-normalized draws.
The following Cauchy models are from Michael Betancourt’s case study Fitting The Cauchy Distribution
We already looked at the nominal Cauchy model with direct parameterization in the main text, but for completeness, we take a closer look using different variants of the diagnostics.
writeLines(readLines("cauchy_nom.stan"))
parameters {
vector[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the nominal model:
fit_nom <- stan(file = 'cauchy_nom.stan', seed = 7878, refresh = 0)
Warning: There were 1233 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems
Treedepth exceedence and Bayesian Fraction of Missing Information are dynamic HMC specific diagnostics (Betancourt, 2017). We get warnings about very large number of transitions after warmup that exceeded the maximum treedepth, which is likely due to very long tails of the Cauchy distribution. All chains have low estimated Bayesian fraction of missing information also indicating slow mixing.
Trace plots for the first parameter look wild with occasional large values:
samp <- as.array(fit_nom)
mcmc_trace(samp[, , 1])
Let’s check Rhat and ESS diagnostics.
res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)
For one parameter, Rhats exceed the classic threshold of 1.1. Depending on the Rhat estimate, a few others also exceed the threshold of 1.01. The rank normalized split-Rhat has several values over 1.01. Please note that the classic split-Rhat is not well defined in this example, because mean and variance of the Cauchy distribution are not finite.
plot_ess(res)
Both classic and new effective sample size estimates have several very small values, and so the overall sample shouldn’t be trusted.
Result: Effective sample size is more sensitive than (rank-normalized) split-Rhat to detect problems of slow mixing.
We also check the log posterior value lp__ and find out that the effective sample size is worryingly low.
res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-ESS = ', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 117
cat('lp__: Tail-ESS = ', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 322.59
We can further analyze potential problems using local effective sample size and rank plots. We examine x[36], which has the smallest tail-ESS of 117.
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates (see Section Efficiency for small interval probability estimates). Each interval contains \(1/k\) of the draws (e.g., with \(k=20\)). The small interval efficiency measures mixing of an indicator function which indicates when the values are inside the specific small interval. This gives us a local efficiency measure which does not depend on the shape of the distribution.
plot_local_ess(fit = fit_nom, par = which_min_ess, nalpha = 20)
We see that the efficiency is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show draws that exceeded the maximum treedepth.
An alternative way to examine the effective sample size in different parts of the posterior is to compute effective sample size for quantiles (see Section Efficiency for quantiles). Each interval has a specified proportion of draws, and the efficiency measures mixing of an indicator function’s results which indicate when the values are inside the specific interval.
plot_quantile_ess(fit = fit_nom, par = which_min_ess, nalpha = 40)
We see that the efficiency is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show draws that exceeded the maximum treedepth.
We can further analyze potential problems using rank plots, from which we clearly see differences between chains.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
We can try to improve the performance by increasing max_treedepth to \(20\):
fit_nom_td20 <- stan(
file = 'cauchy_nom.stan', seed = 7878,
refresh = 0, control = list(max_treedepth = 20)
)
Trace plots for the first parameter still look wild with occasional large values.
samp <- as.array(fit_nom_td20)
mcmc_trace(samp[, , 1])
res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)
We check the diagnostics for all \(x\).
plot_rhat(res)
All Rhats are below \(1.1\), but many are over \(1.01\). Classic split-Rhat has more variation than the rank normalized Rhat (note that the former is not well defined). The folded rank normalized Rhat shows that there is still more variation in the scale than in the location between different chains.
plot_ess(res)
Some classic effective sample sizes are very small. If we wouldn’t realize that the variance is infinite, we might try to run longer chains, but in case of an infinite variance, zero relative efficiency (ESS/S) is the truth and longer chains won’t help with that. However other quantities can be well defined, and that’s why it is useful to also look at the rank normalized version as a generic transformation to achieve finite mean and variance. The smallest bulk-ESS is less than 1000, which is not that bad. The smallest median-ESS is larger than 2500, that is we are able to estimate the median efficiently. However, many tail-ESS’s are less than 400 indicating problems for estimating the scale of the posterior.
Result: The rank normalized effective sample size is more stable than classic effective sample size, which is not well defined for the Cauchy distribution.
Result: It is useful to look at both bulk- and tail-ESS.
We check also lp__. Although increasing max_treedepth improved bulk-ESS of x, the efficiency for lp__ didn’t change.
res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 240
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 586.95
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit = fit_nom_td20, par = which_min_ess, nalpha = 20)
It seems that increasing max_treedepth has not much improved the efficiency in the tails. We also examine the effective sample size of different quantile estimates.
plot_quantile_ess(fit = fit_nom_td20, par = which_min_ess, nalpha = 40)
The rank plot visualisation of x[11], which has the smallest tail-ESS of NaN among the \(x\), indicates clear convergence problems.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
The rank plot visualisation of lp__, which has an effective sample size 240, doesn’t look so good either.
mcmc_hist_r_scale(samp[, , "lp__"])
Let’s try running 8 times longer chains.
fit_nom_td20l <- stan(
file = 'cauchy_nom.stan', seed = 7878,
refresh = 0, control = list(max_treedepth = 20),
warmup = 1000, iter = 9000
)
Warning: There were 7 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 20. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems
Trace plots for the first parameter still look wild with occasional large values.
samp <- as.array(fit_nom_td20l)
mcmc_trace(samp[, , 1])
res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)
Let’s check the diagnostics for all \(x\).
plot_rhat(res)
All Rhats are below \(1.01\). The classic split-Rhat has more variation than the rank normalized Rhat (note that the former is not well defined in this case).
plot_ess(res)
Most classic ESS’s are close to zero. Running longer chains just made most classic ESS’s even smaller.
The smallest bulk-ESS are around 5000, which is not that bad. The smallest median-ESS’s are larger than 25000, that is we are able to estimate the median efficiently. However, the smallest tail-ESS is 919.37 indicating problems for estimating the scale of the posterior.
Result: The rank normalized effective sample size is more stable than classic effective sample size even for very long chains.
Result: It is useful to look at both bulk- and tail-ESS.
We also check lp__. Although increasing the number of iterations improved bulk-ESS of the \(x\), the relative efficiency for lp__ didn’t change.
res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1289
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 1886.54
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 20)
Increasing the chain length did not seem to change the relative efficiency. With more draws from the longer chains we can use a finer resolution for the local efficiency estimates.
plot_local_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 100)
The sampling efficiency far in the tails is worryingly low. This was more difficult to see previously with less draws from the tails. We see similar problems in the plot of effective sample size for quantiles.
plot_quantile_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 100)
Let’s look at the rank plot visualisation of x[39], which has the smallest tail-ESS NaN among the \(x\).
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
Increasing the number of iterations couldn’t remove the mixing problems at the tails. The mixing problem is inherent to the nominal parameterization of Cauchy distribution.
Next, we examine an alternative parameterization and consider the Cauchy distribution as a scale mixture of Gaussian distributions. The model has two parameters and the Cauchy distributed \(x\) can be computed from those. In addition to improved sampling performance, the example illustrates that focusing on diagnostics matters.
writeLines(readLines("cauchy_alt_1.stan"))
parameters {
vector[50] x_a;
vector<lower=0>[50] x_b;
}
transformed parameters {
vector[50] x = x_a ./ sqrt(x_b);
}
model {
x_a ~ normal(0, 1);
x_b ~ gamma(0.5, 0.5);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
We run the alternative model:
fit_alt1 <- stan(file='cauchy_alt_1.stan', seed=7878, refresh = 0)
There are no warnings and the sampling is much faster.
samp <- as.array(fit_alt1)
res <- monitor_extra(samp[, , 101:150])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)
All Rhats are below \(1.01\). Classic split-Rhats also look good even though they are not well defined for the Cauchy distribution.
plot_ess(res)
Result: Rank normalized ESS’s have less variation than classic one which is not well defined for Cauchy.
We check lp__:
res <- monitor_extra(samp[, , 151:152])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1310
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 1927.69
The relative efficiencies for lp__ are also much better than with the nominal parameterization.
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit = fit_alt1, par = 100 + which_min_ess, nalpha = 20)
The effective sample size is good in all parts of the posterior. We also examine the effective sample size of different quantile estimates.
plot_quantile_ess(fit = fit_alt1, par = 100 + which_min_ess, nalpha = 40)
We compare the mean relative efficiencies of the underlying parameters in the new parameterization and the actual \(x\) we are interested in.
res <- monitor_extra(samp[, , 101:150])
res1 <- monitor_extra(samp[, , 1:50])
res2 <- monitor_extra(samp[, , 51:100])
cat('Mean Bulk-ESS for x =' , round(mean(res[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x = 3629.24
cat('Mean Tail-ESS for x =' , round(mean(res[, 'tailseff']), 2), '\n')
Mean Tail-ESS for x = 2265.23
cat('Mean Bulk-ESS for x_a =' , round(mean(res1[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x_a = 3956.06
cat('Mean Bulk-ESS for x_b =' , round(mean(res2[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x_b = 2761.22
Result: We see that the effective sample size of the interesting \(x\) can be different from the effective sample size of the parameters \(x_a\) and \(x_b\) that we used to compute it.
The rank plot visualisation of x[40], which has the smallest tail-ESS of 1822.64 among the \(x\) looks better than for the nominal parameterization.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
Similarly, the rank plot visualisation of lp__, which has a relative efficiency of -81.34, 0.23, 8.08, -95.19, -80.99, -68.66, 1288, 0.32, 1303, 1296, 1310, 0.33, 1, 1, 1, 1, 1, 2366, 0.59, 1927.69, 0.48, 1708, 0.43, 2912, 0.73 looks better than for the nominal parameterization.
mcmc_hist_r_scale(samp[, , "lp__"])
Another alternative parameterization is obtained by a univariate transformation as shown in the following code (see also the 3rd alternative in Michael Betancourt’s case study).
writeLines(readLines("cauchy_alt_3.stan"))
parameters {
vector<lower=0, upper=1>[50] x_tilde;
}
transformed parameters {
vector[50] x = tan(pi() * (x_tilde - 0.5));
}
model {
// Implicit uniform prior on x_tilde
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
We run the alternative model:
fit_alt3 <- stan(file='cauchy_alt_3.stan', seed=7878, refresh = 0)
There are no warnings, and the sampling is much faster than for the nominal model.
samp <- as.array(fit_alt3)
res <- monitor_extra(samp[, , 51:100])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)
All Rhats except some folded Rhats are below 1.01. Classic split-Rhat’s look also good even though it is not well defined for the Cauchy distribution.
plot_ess(res)
Result: Rank normalized relative efficiencies have less variation than classic ones. Bulk-ESS and median-ESS are slightly larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.
We also take a closer look at the lp__ value:
res <- monitor_extra(samp[, , 101:102])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1494
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 1883.75
The effective sample size for these are also much better than with the nominal parameterization.
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit = fit_alt3, par = 50 + which_min_ess, nalpha = 20)
We examine also the sampling efficiency of different quantile estimates.
plot_quantile_ess(fit = fit_alt3, par = 50 + which_min_ess, nalpha = 40)
The effective sample size in tails is worse than for the first alternative parameterization, although it’s still better than for the nominal parameterization.
We compare the mean effective sample size of the underlying parameter in the new parameterization and the actually Cauchy distributed \(x\) we are interested in.
res <- monitor_extra(samp[, , 51:100])
res1 <- monitor_extra(samp[, , 1:50])
cat('Mean bulk-seff for x =' , round(mean(res[, 'zsseff']), 2), '\n')
Mean bulk-seff for x = 4702.98
cat('Mean tail-seff for x =' , round(mean(res[, 'zfsseff']), 2), '\n')
Mean tail-seff for x = 1602.7
cat('Mean bulk-seff for x_tilde =' , round(mean(res1[, 'zsseff']), 2), '\n')
Mean bulk-seff for x_tilde = 4702.98
cat('Mean tail-seff for x_tilde =' , round(mean(res1[, 'zfsseff']), 2), '\n')
Mean tail-seff for x_tilde = 1612.14
The Rank plot visualisation of x[5], which has the smallest tail-ESS of 1890.99 among the \(x\) reveals shows good efficiency, similar to the results for lp__.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
mcmc_hist_r_scale(samp[, , "lp__"])
Half-Cauchy priors are common and, for example, in Stan usually set using the nominal parameterization. However, when the constraint <lower=0> is used, Stan does the sampling automatically in the unconstrained log(x) space, which changes the geometry crucially.
writeLines(readLines("half_cauchy_nom.stan"))
parameters {
vector<lower=0>[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
We run the half-Cauchy model with nominal parameterization (and positive constraint).
fit_half_nom <- stan(file = 'half_cauchy_nom.stan', seed = 7878, refresh = 0)
There are no warnings and the sampling is much faster than for the full Cauchy distribution with nominal parameterization.
samp <- as.array(fit_half_nom)
res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)
All Rhats are below \(1.01\). Classic split-Rhats also look good even though they are not well defined for the half-Cauchy distribution.
plot_ess(res)
Result: Rank normalized effective sample size have less variation than classic ones. Some Bulk-ESS and median-ESS are larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.
Due to the <lower=0> constraint, the sampling was made in the log(x) space, and we can also check the performance in that space.
res <- monitor_extra(log(samp[, , 1:50]))
plot_ess(res)
\(\log(x)\) is quite close to Gaussian, and thus classic effective sample size is also close to rank normalized ESS which is exactly the same as for the original \(x\) as rank normalization is invariant to bijective transformations.
Result: The rank normalized effective sample size is close to the classic effective sample size for transformations which make the distribution close to Gaussian.
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit = fit_half_nom, par = which_min_ess, nalpha = 20)
The effective sample size is good overall, with only a small dip in tails. We can also examine the effective sample size of different quantile estimates.
plot_quantile_ess(fit = fit_half_nom, par = which_min_ess, nalpha = 40)
The rank plot visualisation of x[32], which has the smallest tail-ESS of 1742 among \(x\), looks good.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
The rank plot visualisation of lp__ reveals some small differences in the scales, but it’s difficult to know whether this small variation from uniform is relevant.
mcmc_hist_r_scale(samp[, , "lp__"])
writeLines(readLines("half_cauchy_alt.stan"))
parameters {
vector<lower=0>[50] x_a;
vector<lower=0>[50] x_b;
}
transformed parameters {
vector[50] x = x_a .* sqrt(x_b);
}
model {
x_a ~ normal(0, 1);
x_b ~ inv_gamma(0.5, 0.5);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run half-Cauchy with alternative parameterization
fit_half_reparam <- stan(
file = 'half_cauchy_alt.stan', seed = 7878, refresh = 0
)
There are no warnings and the sampling is as fast for the half-Cauchy nominal model.
samp <- as.array(fit_half_reparam)
res <- monitor_extra(samp[, , 101:150])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)
plot_ess(res)
Result: The Rank normalized relative efficiencies have less variation than classic ones which is not well defined for the Cauchy distribution. Based on bulk-ESS and median-ESS, the efficiency for central quantities is much lower, but based on tail-ESS and MAD-ESS, the efficiency in the tails is slightly better than for the half-Cauchy distribution with nominal parameterization. We also see that a parameterization which is good for the full Cauchy distribution is not necessarily good for the half-Cauchy distribution as the <lower=0> constraint additionally changes the parameterization.
We also check the lp__ values:
res <- monitor_extra(samp[, , 151:152])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 977
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 1750.46
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit_half_reparam, par = 100 + which_min_ess, nalpha = 20)
We also examine the effective sample size for different quantile estimates.
plot_quantile_ess(fit_half_reparam, par = 100 + which_min_ess, nalpha = 40)
The effective sample size near zero is much worse than for the half-Cauchy distribution with nominal parameterization.
The Rank plot visualisation of x[20], which has the smallest tail-ESS of NaN among the \(x\), reveals deviations from uniformity, which is expected with lower effective sample size.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
A similar result is obtained when looking at the rank plots of lp__.
mcmc_hist_r_scale(samp[, , "lp__"])
So far, we have run all models in Stan, but we want to also investigate whether similar problems arise with probabilistic programming languages that use other samplers than variants of Hamiltonian Monte-Carlo. Thus, we will fit the eight schools models also with Jags, which uses a dialect of the BUGS language to specify models. Jags uses a clever mix of Gibbs and Metropolis-Hastings sampling. This kind of sampling does not scale well to high dimensional posteriors of strongly interdependent parameters, but for the relatively simple models discussed in this case study it should work just fine.
The Jags code for the nominal parameteriztion of the cauchy distribution looks as follows:
writeLines(readLines("cauchy_nom.bugs"))
model {
for (i in 1:50) {
x[i] ~ dt(0, 1, 1)
}
}
First, we initialize the Jags model for reusage later.
jags_nom <- jags.model(
"cauchy_nom.bugs",
n.chains = 4, n.adapt = 10000
)
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 0
Unobserved stochastic nodes: 50
Total graph size: 52
Initializing model
Next, we sample 1000 iterations for each of the 4 chains for easy comparison with the corresponding Stan results.
samp_jags_nom <- coda.samples(
jags_nom, variable.names = "x",
n.iter = 1000
)
samp_jags_nom <- aperm(abind(samp_jags_nom, along = 3), c(1, 3, 2))
dimnames(samp_jags_nom)[[2]] <- paste0("chain:", 1:4)
We summarize the model as follows:
mon <- monitor(samp_jags_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 1000; warmup = 0):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
x[1] -6.065 -0.011 6.368 -3.288 164.714 1.001 4374 4001
x[2] -6.634 -0.003 6.245 -0.644 32.410 1.000 4092 3971
x[3] -6.639 0.019 6.721 3.589 229.228 1.000 4092 3763
x[4] -6.355 -0.056 5.638 -0.510 33.838 1.000 4457 3891
x[5] -5.489 -0.020 6.375 0.181 58.609 1.001 4124 3885
x[6] -6.355 -0.006 6.357 0.767 44.227 1.000 3804 3843
x[7] -7.672 -0.034 6.916 0.762 65.339 1.000 4007 4016
x[8] -6.368 0.018 6.494 -0.479 27.321 1.001 4069 3890
x[9] -6.054 -0.009 6.803 1.547 68.922 1.000 3864 3972
x[10] -6.864 0.016 6.747 52.663 3292.059 1.001 3771 3974
x[11] -6.784 0.033 5.830 0.039 35.321 1.001 4204 4102
x[12] -5.745 0.035 5.992 -1.009 168.276 1.001 4076 4012
x[13] -6.211 0.045 6.651 -1.789 63.149 1.000 3795 3837
x[14] -6.108 0.025 6.136 0.725 57.857 1.001 3969 3972
x[15] -5.626 -0.019 6.077 0.657 135.151 1.000 4124 4099
x[16] -7.235 -0.018 6.043 -4.887 188.630 1.000 3865 3665
x[17] -6.014 0.038 5.928 -0.772 61.074 1.001 3836 3613
x[18] -6.192 0.015 5.778 -0.289 32.625 1.000 3971 4025
x[19] -5.833 0.018 6.533 -0.416 34.581 1.000 3693 3663
x[20] -6.336 -0.036 6.053 -0.673 73.409 1.001 4171 4015
x[21] -6.181 -0.030 5.674 -0.697 24.795 1.001 4053 3917
x[22] -5.852 -0.029 5.425 8.866 588.074 1.001 3405 3737
x[23] -5.877 0.002 6.059 -6.604 366.713 1.000 3648 3931
x[24] -6.208 0.018 6.222 1.601 123.503 1.001 3753 4013
x[25] -6.454 -0.007 6.691 4.273 254.640 1.001 4021 3822
x[26] -6.104 0.019 6.641 0.485 46.759 1.001 4117 3803
x[27] -6.657 -0.021 5.997 4.694 189.151 1.000 3658 3834
x[28] -5.738 -0.019 5.979 -0.431 45.203 1.000 4062 3933
x[29] -6.622 -0.001 6.738 0.363 80.103 1.000 3886 4024
x[30] -6.590 0.040 5.683 0.052 27.396 1.000 3686 3271
x[31] -6.247 0.004 5.946 -7.028 490.306 1.001 3382 4001
x[32] -6.491 -0.022 5.783 1.860 58.397 1.001 4101 3919
x[33] -5.972 -0.003 6.440 2.577 106.670 1.000 4088 3889
x[34] -6.417 0.053 7.271 -2.861 127.422 1.000 3823 3932
x[35] -6.484 0.021 5.877 0.576 44.268 1.001 3933 3814
x[36] -6.703 -0.048 5.432 -0.940 50.770 1.000 3519 3826
x[37] -6.403 -0.040 6.134 -1.857 62.055 1.000 3759 3930
x[38] -7.000 -0.004 6.331 0.259 135.908 1.000 4003 3678
x[39] -6.076 -0.007 5.916 -1.360 48.272 0.999 4273 3812
x[40] -6.047 0.008 6.534 -0.570 35.955 1.001 3598 3914
x[41] -5.617 0.033 6.945 -0.508 188.046 1.000 3831 3808
x[42] -6.911 -0.007 6.039 -1.020 27.088 1.001 4323 4095
x[43] -6.754 -0.003 6.726 1.535 101.810 1.001 3890 3513
x[44] -6.265 -0.019 6.944 3.971 159.152 1.001 3765 3890
x[45] -6.004 0.016 6.369 53.339 3338.077 1.000 3746 3648
x[46] -5.547 -0.027 5.845 -17.738 871.215 0.999 4136 3928
x[47] -6.161 0.027 6.389 1.013 326.707 1.000 4005 3785
x[48] -5.728 0.030 7.437 0.731 29.429 1.001 4063 3874
x[49] -6.796 -0.026 5.979 -0.238 48.647 1.000 3847 3852
x[50] -6.013 0.030 6.624 -1.144 46.479 1.001 3671 3752
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
which_min_ess <- which.min(mon[1:50, 'Bulk_ESS'])
The overall results look very promising with Rhats = 1 and ESS values close to the total number of draws of 4000. We take a detailed look at x[31], which has the smallest bulk-ESS of 3382.
We examine the sampling efficiency in different parts of the posterior by computing the efficiency estimates for small interval probability estimates.
plot_local_ess(fit = samp_jags_nom, par = which_min_ess, nalpha = 20)
The efficiency estimate is good in all parts of the posterior. Further, we examine the sampling efficiency of different quantile estimates.
plot_quantile_ess(fit = samp_jags_nom, par = which_min_ess, nalpha = 40)
Rank plots also look rather similar across chains.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp_jags_nom[, , xmin])
Result: Jags seems to be able to sample from the nominal parameterization of the Cauchy distribution just fine.
We continue with our discussion about hierarchical models on the Eight Schools data, which we started in Section Eight Schools. We also analyse the performance of different variants of the diagnostics.
writeLines(readLines("eight_schools_cp.stan"))
data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}
parameters {
real mu;
real<lower=0> tau;
real theta[J];
}
model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta ~ normal(mu, tau);
y ~ normal(theta, sigma);
}
In the main text, we observed that the centered parameterization of this hierarchical model did not work well with the default MCMC options of Stan plus increased adapt_delta, and so we directly try to fit the model with longer chains.
Low efficiency can be sometimes compensated with longer chains. Let’s check 10 times longer chain.
fit_cp2 <- stan(
file = 'eight_schools_cp.stan', data = eight_schools,
iter = 20000, chains = 4, seed = 483892929, refresh = 0,
control = list(adapt_delta = 0.95)
)
Warning: There were 2335 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 1 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
monitor(fit_cp2)
Inference for the input samples (4 chains: each with iter = 20000; warmup = 10000):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -0.994 4.836 10.339 4.875 3.566 1.048 71 189
tau 0.329 2.809 10.022 3.674 3.307 1.076 45 17
theta[1] -1.362 6.432 16.320 6.760 5.639 1.013 407 9491
theta[2] -2.478 5.472 12.555 5.416 4.804 1.024 153 9429
theta[3] -4.923 4.664 11.525 4.405 5.434 1.030 117 10374
theta[4] -2.886 5.337 12.396 5.233 4.971 1.026 140 9670
theta[5] -4.484 4.317 10.691 4.091 4.937 1.039 89 4758
theta[6] -4.148 4.697 11.274 4.490 5.079 1.028 118 11277
theta[7] -0.883 6.596 15.527 6.829 5.108 1.011 449 11102
theta[8] -3.465 5.411 13.337 5.337 5.493 1.022 172 10408
lp__ -24.943 -14.781 0.216 -13.837 7.586 1.068 50 86
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(fit_cp2)
print(res)
Inference for the input samples (4 chains: each with iter = 20000; warmup = 10000):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff Rhat
mu 4.875 0.490 3.566 -0.994 4.836 10.339 53 0.001 71 54 71 0.002 1.050
tau 3.674 0.298 3.307 0.329 2.809 10.022 123 0.003 173 35 45 0.001 1.025
theta[1] 6.760 0.219 5.639 -1.362 6.432 16.320 666 0.017 1057 281 407 0.010 1.008
theta[2] 5.416 0.431 4.804 -2.478 5.472 12.555 124 0.003 169 113 153 0.004 1.022
theta[3] 4.405 0.530 5.434 -4.923 4.664 11.525 105 0.003 146 86 117 0.003 1.026
theta[4] 5.233 0.458 4.971 -2.886 5.337 12.396 118 0.003 163 105 140 0.004 1.024
theta[5] 4.091 0.566 4.937 -4.484 4.317 10.691 76 0.002 102 69 89 0.002 1.034
theta[6] 4.490 0.508 5.079 -4.148 4.697 11.274 100 0.002 137 87 118 0.003 1.025
theta[7] 6.829 0.226 5.108 -0.883 6.596 15.527 512 0.013 745 309 449 0.011 1.009
theta[8] 5.337 0.432 5.493 -3.465 5.411 13.337 162 0.004 231 125 172 0.004 1.018
lp__ -13.837 1.321 7.586 -24.943 -14.781 0.216 33 0.001 44 37 50 0.001 1.081
sRhat zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff
mu 1.050 1.048 1.048 1.023 152 0.004 189 0.005 174 0.004 174
tau 1.023 1.077 1.076 1.008 1035 0.026 17 0.000 175 0.004 167
theta[1] 1.008 1.012 1.013 1.001 3297 0.082 9491 0.237 177 0.004 268
theta[2] 1.022 1.024 1.024 1.005 1817 0.045 9429 0.236 173 0.004 177
theta[3] 1.025 1.031 1.030 1.008 736 0.018 10374 0.259 168 0.004 172
theta[4] 1.023 1.026 1.026 1.005 1468 0.037 9670 0.242 167 0.004 167
theta[5] 1.035 1.037 1.039 1.012 375 0.009 4758 0.119 170 0.004 178
theta[6] 1.025 1.028 1.028 1.009 644 0.016 11277 0.282 176 0.004 176
theta[7] 1.008 1.011 1.011 1.000 2761 0.069 11102 0.278 179 0.004 852
theta[8] 1.017 1.022 1.022 1.003 2816 0.070 10408 0.260 166 0.004 191
lp__ 1.080 1.071 1.068 1.059 55 0.001 86 0.002 170 0.004 157
madsreff
mu 0.004
tau 0.004
theta[1] 0.007
theta[2] 0.004
theta[3] 0.004
theta[4] 0.004
theta[5] 0.004
theta[6] 0.004
theta[7] 0.021
theta[8] 0.005
lp__ 0.004
We still get a whole bunch of divergent transitions so it’s clear that the results can’t be trusted even if all other diagnostics were good. Still, it may be worth looking at additional diagnostics to better understand what’s happening.
Some rank-normalized split-Rhats are still larger than \(1.01\). Bulk-ESS for tau and lp__ are around 800 which corresponds to low relative efficiency of \(1\%\), but is above our recommendation of ESS>400. In this kind of cases, it is useful to look at the local efficiency estimates, too (and the larger number of divergences is clear indication of problems, too).
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small intervals for tau.
plot_local_ess(fit = fit_cp2, par = "tau", nalpha = 50)
We see that the sampling has difficulties in exploring small tau values. As ESS<400 for small probability intervals in case of small tau values, we may suspect that we may miss substantial amount of posterior mass and get biased estimates.
We also examine the effective sample size of different quantile estimates.
plot_quantile_ess(fit = fit_cp2, par = "tau", nalpha = 100)
Several quantile estimates have ESS<400, which raises a doubt that there are convergence problems and we may have significant bias.
Let’s see how the Bulk-ESS and Tail-ESS changes when we use more and more draws.
plot_change_ess(fit = fit_cp2, par = "tau")
We see that given recommendation that Bulk-ESS>400 and Tail-ESS>400, they are not sufficient to detect convergence problems in this case, even the tail quantile estimates are able to detect these problems.
The rank plot visualisation of tau also shows clear sticking and mixing problems.
samp_cp2 <- as.array(fit_cp2)
mcmc_hist_r_scale(samp_cp2[, , "tau"])
Similar results are obtained for lp__, which is closely connected to tau for this model.
mcmc_hist_r_scale(samp_cp2[, , "lp__"])
We may also examine small interval efficiencies for mu.
plot_local_ess(fit = fit_cp2, par = "mu", nalpha = 50)
There are gaps of poor efficiency which again indicates problems in the mixing of the chains. However, these problems do not occur for any specific range of values of mu as was the case for tau. This tells us that it’s probably not mu with which the sampler has problems, but more likely tau or a related quantity.
As we observed divergences, we shouldn’t trust any Monte Carlo standard error (MCSE) estimates as they are likely biased, as well. However, for illustration purposes, we compute the MCSE, tail quantiles and corresponding effective sample sizes for the median of mu and tau. Comparing to the shorter MCMC run, using 10 times more draws has not reduced the MCSE to one third as would be expected without problems in the mixing of the chains.
round(quantile_mcse(samp_cp2[ , , "mu"], prob = 0.5), 2)
mcse Q05 Q95 Seff
1 0.37 4.22 5.43 173.52
round(quantile_mcse(samp_cp2[ , , "tau"], prob = 0.5), 2)
mcse Q05 Q95 Seff
1 0.27 2.38 3.27 174.86
For further evidence, let’s check 100 times longer chains than the default. This is not something we would recommend doing in practice, as it is not able to solve any problems with divergences as illustrated below.
fit_cp3 <- stan(
file = 'eight_schools_cp.stan', data = eight_schools,
iter = 200000, chains = 4, seed = 483892929, refresh = 0,
control = list(adapt_delta = 0.95)
)
Warning: There were 11699 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
monitor(fit_cp3)
Inference for the input samples (4 chains: each with iter = 2e+05; warmup = 1e+05):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -1.097 4.371 9.829 4.371 3.326 1.001 18335 30265
tau 0.474 2.944 10.043 3.797 3.205 1.001 2200 769
theta[1] -1.588 5.733 16.380 6.290 5.694 1.000 23832 110854
theta[2] -2.528 4.850 12.763 4.944 4.756 1.000 27789 136002
theta[3] -5.055 4.086 11.880 3.866 5.361 1.000 39355 122761
theta[4] -2.950 4.684 12.644 4.749 4.862 1.000 32607 138545
theta[5] -4.553 3.786 10.766 3.552 4.715 1.000 34479 44492
theta[6] -4.164 4.156 11.618 4.011 4.907 1.000 37000 92227
theta[7] -1.025 5.922 15.636 6.380 5.161 1.000 20685 58049
theta[8] -3.490 4.740 13.507 4.854 5.393 1.000 36212 125498
lp__ -24.983 -15.148 -2.076 -14.584 6.871 1.001 2541 1074
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(fit_cp3)
print(res)
Inference for the input samples (4 chains: each with iter = 2e+05; warmup = 1e+05):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff Rhat
mu 4.371 0.024 3.326 -1.097 4.371 9.829 18435 0.046 18411 18358 18335 0.046 1.000
tau 3.797 0.033 3.205 0.474 2.944 10.043 9355 0.023 9391 2206 2200 0.006 1.000
theta[1] 6.290 0.032 5.694 -1.588 5.733 16.380 30955 0.077 31120 23846 23832 0.060 1.000
theta[2] 4.944 0.026 4.756 -2.528 4.850 12.763 32922 0.082 33013 27715 27789 0.069 1.000
theta[3] 3.866 0.023 5.361 -5.055 4.086 11.880 53609 0.134 53632 39181 39355 0.098 1.000
theta[4] 4.749 0.024 4.862 -2.950 4.684 12.644 39472 0.099 39723 32535 32607 0.082 1.000
theta[5] 3.552 0.023 4.715 -4.553 3.786 10.766 41358 0.103 41819 34256 34479 0.086 1.000
theta[6] 4.011 0.023 4.907 -4.164 4.156 11.618 45877 0.115 46105 36785 37000 0.092 1.000
theta[7] 6.380 0.033 5.161 -1.025 5.922 15.636 24701 0.062 24691 20682 20685 0.052 1.000
theta[8] 4.854 0.024 5.393 -3.490 4.740 13.507 50166 0.125 50212 36407 36212 0.091 1.000
lp__ -14.584 0.140 6.871 -24.983 -15.148 -2.076 2414 0.006 2410 2545 2541 0.006 1.001
sRhat zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff
mu 1.001 1.000 1.001 1.000 28848 0.072 30265 0.076 16309 0.041 18276
tau 1.000 1.001 1.001 1.000 32860 0.082 769 0.002 10904 0.027 15361
theta[1] 1.000 1.000 1.000 1.000 40586 0.101 110854 0.277 15333 0.038 18846
theta[2] 1.000 1.000 1.000 1.000 41788 0.104 136002 0.340 15135 0.038 18463
theta[3] 1.000 1.000 1.000 1.000 31386 0.078 122761 0.307 16680 0.042 22054
theta[4] 1.000 1.000 1.000 1.000 38705 0.097 138545 0.346 16525 0.041 19038
theta[5] 1.000 1.000 1.000 1.000 37235 0.093 44492 0.111 16593 0.041 18887
theta[6] 1.000 1.000 1.000 1.000 36568 0.091 92227 0.231 16393 0.041 18180
theta[7] 1.000 1.000 1.000 1.000 35129 0.088 58049 0.145 14827 0.037 17370
theta[8] 1.000 1.000 1.000 1.000 38924 0.097 125498 0.314 15431 0.039 16092
lp__ 1.001 1.001 1.001 1.001 2942 0.007 1074 0.003 10102 0.025 13775
madsreff
mu 0.046
tau 0.038
theta[1] 0.047
theta[2] 0.046
theta[3] 0.055
theta[4] 0.048
theta[5] 0.047
theta[6] 0.045
theta[7] 0.043
theta[8] 0.040
lp__ 0.034
Rhat, Bulk-ESS and Tail-ESS are not able to detect problems, although Tail-ESS for tau is suspiciously low compared to total number of draws.
plot_local_ess(fit = fit_cp3, par = "tau", nalpha = 100)
plot_quantile_ess(fit = fit_cp3, par = "tau", nalpha = 100)
And the rank plots of tau also show sticking and mixing problems for small values of tau.
samp_cp3 <- as.array(fit_cp3)
mcmc_hist_r_scale(samp_cp3[, , "tau"])
What we do see is an advantage of rank plots over trace plots as even with 100000 draws per chain, rank plots don’t get crowded and the mixing problems of chains is still easy to see.
With centered parameterization the mean estimate tends to get smaller with more draws. With 400000 draws using the centered parameterization the mean estimate is 3.77 (se 0.03). With 40000 draws using the non-centered parameterization the mean estimate is 3.6 (se 0.02). The difference is more than 8 sigmas. We are able to see the convergence problems in the centered parameterization case, if we do look carefully (or use divergence diagnostic ), but we do see that Rhat, Bulk-ESS, Tail-ESS and Monte Carlo error estimates for the mean can’t be trusted if other diagnostics indicate convergence problems!
When autocorrelation time is high, it has been common to thin the chains by saving only a small portion of the draws. This will throw away useful information also for convergence diagnostics. With 400000 iterations per chain, thinning of 200 and 4 chains, we again end up with 4000 iterations as with the default settings.
fit_cp4 <- stan(
file = 'eight_schools_cp.stan', data = eight_schools,
iter = 400000, thin = 200, chains = 4, seed = 483892929, refresh = 0,
control = list(adapt_delta = 0.95)
)
Warning: There were 93 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
We observe several divergent transitions and the estimated Bayesian fraction of missing information is also low, which indicate convergence problems and potentially biased estimates.
Unfortunately the thinning makes Rhat and ESS estimates to miss the problems. The posterior mean is still biased, being more than 3 sigmas away from the estimate obtained using non-centered parameterization.
monitor(fit_cp4)
Inference for the input samples (4 chains: each with iter = 4e+05; warmup = 2e+05):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -0.909 4.461 9.732 4.405 3.243 1.000 3784 3648
tau 0.462 2.890 10.031 3.750 3.160 1.001 3625 2447
theta[1] -1.664 5.628 16.162 6.240 5.737 1.000 4101 3691
theta[2] -2.166 4.844 12.614 5.042 4.619 1.000 3950 3946
theta[3] -4.544 4.165 11.876 3.976 5.206 1.000 4121 3819
theta[4] -3.017 4.727 12.425 4.747 4.834 1.000 4026 4188
theta[5] -4.383 3.752 10.560 3.553 4.685 1.001 3790 3839
theta[6] -3.762 4.296 11.814 4.182 4.859 1.000 4057 4059
theta[7] -0.958 5.914 15.396 6.340 4.997 1.001 4154 3813
theta[8] -3.545 4.642 13.475 4.783 5.328 1.000 4040 3968
lp__ -25.059 -15.017 -1.638 -14.424 6.988 1.002 3689 2616
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(fit_cp4)
print(res)
Inference for the input samples (4 chains: each with iter = 4e+05; warmup = 2e+05):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff Rhat
mu 4.405 0.053 3.243 -0.909 4.461 9.732 3737 0.934 3779 3741 3784 0.946 1.000
tau 3.750 0.050 3.160 0.462 2.890 10.031 4006 1.002 4005 3625 3625 0.906 1.000
theta[1] 6.240 0.090 5.737 -1.664 5.628 16.162 4064 1.016 4071 4096 4101 1.025 1.000
theta[2] 5.042 0.074 4.619 -2.166 4.844 12.614 3924 0.981 3935 3939 3950 0.988 1.001
theta[3] 3.976 0.081 5.206 -4.544 4.165 11.876 4097 1.024 4104 4115 4121 1.030 1.000
theta[4] 4.747 0.077 4.834 -3.017 4.727 12.425 3961 0.990 4006 3971 4026 1.006 1.000
theta[5] 3.553 0.077 4.685 -4.383 3.752 10.560 3720 0.930 3810 3742 3790 0.948 1.000
theta[6] 4.182 0.077 4.859 -3.762 4.296 11.814 3945 0.986 4028 4010 4057 1.014 1.000
theta[7] 6.340 0.078 4.997 -0.958 5.914 15.396 4118 1.030 4127 4144 4154 1.038 1.000
theta[8] 4.783 0.085 5.328 -3.545 4.642 13.475 3968 0.992 3989 4014 4040 1.010 1.000
lp__ -14.424 0.118 6.988 -25.059 -15.017 -1.638 3505 0.876 3563 3686 3689 0.922 1.000
sRhat zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff
mu 1.000 1.000 1.000 1.000 3655 0.914 3648 0.912 4193 1.048 3805
tau 1.001 1.000 1.001 1.000 4095 1.024 2447 0.612 3955 0.989 3815
theta[1] 1.000 1.000 1.000 0.999 4200 1.050 3691 0.923 4115 1.029 3902
theta[2] 1.000 1.001 1.000 0.999 4049 1.012 3946 0.987 3556 0.889 4059
theta[3] 1.000 1.000 1.000 1.000 3810 0.952 3819 0.955 4075 1.019 3810
theta[4] 1.000 1.000 1.000 1.000 3919 0.980 4188 1.047 3496 0.874 3885
theta[5] 1.000 1.000 1.000 1.001 3582 0.896 3839 0.960 3834 0.958 3842
theta[6] 0.999 1.000 0.999 1.000 3882 0.970 4059 1.015 4005 1.001 3820
theta[7] 1.001 1.000 1.001 1.000 4061 1.015 3813 0.953 4259 1.065 3871
theta[8] 1.000 1.000 1.000 0.999 3750 0.938 3968 0.992 4139 1.035 3826
lp__ 1.001 1.000 1.001 1.002 3192 0.798 2616 0.654 3835 0.959 3914
madsreff
mu 0.951
tau 0.954
theta[1] 0.976
theta[2] 1.015
theta[3] 0.952
theta[4] 0.971
theta[5] 0.960
theta[6] 0.955
theta[7] 0.968
theta[8] 0.956
lp__ 0.978
Various diagnostic plots of tau look reasonable as well.
plot_local_ess(fit = fit_cp4, par = "tau", nalpha = 100)
plot_quantile_ess(fit = fit_cp4, par = "tau", nalpha = 100)
plot_change_ess(fit = fit_cp4, par = "tau")
However, the rank plots seem still to show the problem.
samp_cp4 <- as.array(fit_cp4)
mcmc_hist_r_scale(samp_cp4[, , "tau"])
In the following, we want to expand our understanding of the non-centered parameterization of the hierarchical model fit to the eight schools data.
writeLines(readLines("eight_schools_ncp.stan"))
data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}
parameters {
real mu;
real<lower=0> tau;
real theta_tilde[J];
}
transformed parameters {
real theta[J];
for (j in 1:J)
theta[j] = mu + tau * theta_tilde[j];
}
model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta_tilde ~ normal(0, 1);
y ~ normal(theta, sigma);
}
In the main text, we have already seen that the non-centered parameterization works better than the centered parameterization, at least when we use an increased adapt_delta value. Let’s see what happens when using the default MCMC option of Stan.
fit_ncp <- stan(
file = 'eight_schools_ncp.stan', data = eight_schools,
iter = 2000, chains = 4, seed = 483892929, refresh = 0
)
Warning: There were 2 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Examine the pairs() plot to diagnose sampling problems
We observe a few divergent transitions with the default of adapt_delta=0.8. Let’s analyze the sample.
monitor(fit_ncp)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -0.980 4.412 9.515 4.376 3.240 1.001 4083 2378
tau 0.248 2.768 9.768 3.608 3.156 1.000 2303 1795
theta_tilde[1] -1.310 0.352 1.883 0.322 0.971 1.002 4571 2604
theta_tilde[2] -1.411 0.135 1.643 0.120 0.916 1.000 5771 3078
theta_tilde[3] -1.620 -0.102 1.486 -0.089 0.957 1.002 4966 3054
theta_tilde[4] -1.434 0.033 1.514 0.046 0.905 1.000 5442 2830
theta_tilde[5] -1.668 -0.172 1.353 -0.155 0.913 1.001 4273 3005
theta_tilde[6] -1.635 -0.084 1.478 -0.075 0.945 1.001 5192 2981
theta_tilde[7] -1.248 0.385 1.883 0.362 0.972 1.000 3898 2800
theta_tilde[8] -1.509 0.073 1.676 0.079 0.970 1.000 4848 2863
theta[1] -1.379 5.678 15.842 6.270 5.601 1.000 3790 2549
theta[2] -2.288 4.877 12.826 5.034 4.621 1.001 5002 2920
theta[3] -4.276 4.079 11.901 3.948 5.239 1.000 4001 3036
theta[4] -2.736 4.658 12.088 4.641 4.634 1.000 4699 3063
theta[5] -4.135 3.894 10.453 3.630 4.537 1.000 4310 3184
theta[6] -4.110 4.188 11.303 3.955 4.878 1.000 4965 2806
theta[7] -0.845 5.857 15.176 6.284 4.943 1.001 4599 3296
theta[8] -3.243 4.771 13.518 4.914 5.371 1.000 4461 3288
lp__ -11.062 -6.469 -3.684 -6.814 2.297 1.001 1711 2385
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(fit_ncp)
print(res)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff
mu 4.376 0.051 3.240 -0.980 4.412 9.515 4023 1.006 4036 4070 4083 1.021
tau 3.608 0.061 3.156 0.248 2.768 9.768 2684 0.671 2700 2293 2303 0.576
theta_tilde[1] 0.322 0.014 0.971 -1.310 0.352 1.883 4528 1.132 4566 4532 4571 1.143
theta_tilde[2] 0.120 0.012 0.916 -1.411 0.135 1.643 5742 1.436 5758 5754 5771 1.443
theta_tilde[3] -0.089 0.014 0.957 -1.620 -0.102 1.486 4929 1.232 4974 4919 4966 1.242
theta_tilde[4] 0.046 0.012 0.905 -1.434 0.033 1.514 5370 1.342 5436 5376 5442 1.361
theta_tilde[5] -0.155 0.014 0.913 -1.668 -0.172 1.353 4222 1.056 4269 4226 4273 1.068
theta_tilde[6] -0.075 0.013 0.945 -1.635 -0.084 1.478 5182 1.296 5195 5179 5192 1.298
theta_tilde[7] 0.362 0.016 0.972 -1.248 0.385 1.883 3885 0.971 3888 3893 3898 0.974
theta_tilde[8] 0.079 0.014 0.970 -1.509 0.073 1.676 4838 1.210 4853 4834 4848 1.212
theta[1] 6.270 0.095 5.601 -1.379 5.678 15.842 3504 0.876 3530 3767 3790 0.948
theta[2] 5.034 0.066 4.621 -2.288 4.877 12.826 4892 1.223 4928 4965 5002 1.250
theta[3] 3.948 0.085 5.239 -4.276 4.079 11.901 3796 0.949 3825 3971 4001 1.000
theta[4] 4.641 0.069 4.634 -2.736 4.658 12.088 4554 1.139 4577 4676 4699 1.175
theta[5] 3.630 0.071 4.537 -4.135 3.894 10.453 4126 1.032 4174 4258 4310 1.077
theta[6] 3.955 0.071 4.878 -4.110 4.188 11.303 4726 1.182 4815 4922 4965 1.241
theta[7] 6.284 0.074 4.943 -0.845 5.857 15.176 4416 1.104 4423 4524 4599 1.150
theta[8] 4.914 0.084 5.371 -3.243 4.771 13.518 4046 1.012 4066 4439 4461 1.115
lp__ -6.814 0.056 2.297 -11.062 -6.469 -3.684 1678 0.420 1684 1704 1711 0.428
Rhat sRhat zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff
mu 1.000 0.999 1.000 0.999 1.001 1775 0.444 2378 0.595 4237 1.059
tau 1.000 1.000 1.000 1.000 1.000 3132 0.783 1795 0.449 3151 0.788
theta_tilde[1] 1.000 1.000 1.000 1.000 1.002 2141 0.535 2604 0.651 4489 1.122
theta_tilde[2] 1.000 0.999 1.000 0.999 1.000 1984 0.496 3078 0.769 5350 1.337
theta_tilde[3] 1.000 1.000 1.000 1.000 1.002 2136 0.534 3054 0.763 5397 1.349
theta_tilde[4] 1.000 1.000 1.000 1.000 1.000 2008 0.502 2830 0.708 4821 1.205
theta_tilde[5] 1.000 1.000 1.000 1.000 1.001 2263 0.566 3005 0.751 4282 1.070
theta_tilde[6] 1.000 1.000 1.000 1.000 1.001 2078 0.520 2981 0.745 4760 1.190
theta_tilde[7] 1.000 1.000 1.000 1.000 1.000 2260 0.565 2800 0.700 3823 0.956
theta_tilde[8] 1.000 1.000 1.000 1.000 1.000 1905 0.476 2863 0.716 4351 1.088
theta[1] 1.001 1.001 1.000 1.000 1.000 2528 0.632 2549 0.637 4358 1.089
theta[2] 1.000 1.000 1.000 1.000 1.001 2302 0.576 2920 0.730 4229 1.057
theta[3] 1.000 1.000 1.000 1.000 1.000 2553 0.638 3036 0.759 4002 1.000
theta[4] 1.000 0.999 1.000 0.999 1.000 2654 0.664 3063 0.766 4548 1.137
theta[5] 1.000 1.000 1.000 1.000 1.000 2783 0.696 3184 0.796 4523 1.131
theta[6] 1.000 0.999 1.000 0.999 1.000 2666 0.666 2806 0.701 4759 1.190
theta[7] 1.000 1.001 1.000 1.000 1.001 2493 0.623 3296 0.824 4315 1.079
theta[8] 1.000 1.000 1.000 0.999 1.000 2472 0.618 3288 0.822 4400 1.100
lp__ 1.001 1.000 1.001 1.001 1.000 2487 0.622 2385 0.596 1990 0.498
madsseff madsreff
mu 2343 0.586
tau 3171 0.793
theta_tilde[1] 2491 0.623
theta_tilde[2] 2406 0.602
theta_tilde[3] 2447 0.612
theta_tilde[4] 2419 0.605
theta_tilde[5] 2692 0.673
theta_tilde[6] 2412 0.603
theta_tilde[7] 2783 0.696
theta_tilde[8] 2000 0.500
theta[1] 2788 0.697
theta[2] 2542 0.636
theta[3] 2971 0.743
theta[4] 2859 0.715
theta[5] 2689 0.672
theta[6] 3152 0.788
theta[7] 2911 0.728
theta[8] 2644 0.661
lp__ 2796 0.699
All Rhats are close to 1, and ESSs are good despite a few divergent transitions. Small interval and quantile plots of tau reveal some sampling problems for small tau values, but not nearly as strong as for the centered parameterization.
plot_local_ess(fit = fit_ncp, par = "tau", nalpha = 20)
plot_quantile_ess(fit = fit_ncp, par = "tau", nalpha = 40)
Overall, the non-centered parameterization looks good even for the default settings of adapt_delta, and increasing it to 0.95 gets rid of the last remaining problems. This stands in sharp contrast to what we observed for the centered parameterization, where increasing adapt_delta didn’t help at all. Actually, this is something we observe quite often: A suboptimal parameterization can cause problems that are not simply solved by tuning the sampler. Instead, we have to adjust our model to achieve trustworthy inference.
We will also run the centered and non-centered parameterizations of the eight schools model with Jags.
The Jags code for the centered eight schools model looks as follows:
writeLines(readLines("eight_schools_cp.bugs"))
model {
for (j in 1:J) {
sigma_prec[j] <- pow(sigma[j], -2)
theta[j] ~ dnorm(mu, tau_prec)
y[j] ~ dnorm(theta[j], sigma_prec[j])
}
mu ~ dnorm(0, pow(5, -2))
tau ~ dt(0, pow(5, -2), 1)T(0, )
tau_prec <- pow(tau, -2)
}
First, we initialize the Jags model for reusage later.
jags_cp <- jags.model(
"eight_schools_cp.bugs",
data = eight_schools,
n.chains = 4, n.adapt = 10000
)
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 8
Unobserved stochastic nodes: 10
Total graph size: 40
Initializing model
Next, we sample 1000 iterations for each of the 4 chains for easy comparison with the corresponding Stan results.
samp_jags_cp <- coda.samples(
jags_cp, c("theta", "mu", "tau"),
n.iter = 1000
)
samp_jags_cp <- aperm(abind(samp_jags_cp, along = 3), c(1, 3, 2))
Convergence diagnostics indicate problems in the sampling of mu and tau, but also to a lesser degree in all other paramters.
mon <- monitor(samp_jags_cp)
print(mon)
Inference for the input samples (4 chains: each with iter = 1000; warmup = 0):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -0.671 4.522 9.701 4.590 3.134 1.028 151 315
tau 0.376 2.754 9.189 3.457 2.846 1.054 79 153
theta[1] -1.320 5.575 15.333 6.237 5.237 1.020 406 819
theta[2] -1.855 4.840 12.649 5.142 4.512 1.024 372 1077
theta[3] -4.399 4.365 11.883 4.232 4.975 1.029 267 963
theta[4] -2.646 4.796 12.578 4.961 4.613 1.022 380 1189
theta[5] -4.185 4.101 10.479 3.744 4.473 1.027 193 781
theta[6] -3.587 4.407 11.162 4.275 4.558 1.020 225 822
theta[7] -0.749 5.659 14.801 6.361 4.856 1.018 372 666
theta[8] -2.825 4.819 13.094 5.006 4.945 1.028 441 1073
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
We also see problems in the sampling of tau using various diagnostic plots.
plot_local_ess(samp_jags_cp, par = "tau", nalpha = 20)
plot_quantile_ess(samp_jags_cp, par = "tau", nalpha = 20)
plot_change_ess(samp_jags_cp, par = "tau")
Let’s see what happens if we run 10 times longer chains.
samp_jags_cp <- coda.samples(
jags_cp, c("theta", "mu", "tau"),
n.iter = 10000
)
samp_jags_cp <- aperm(abind(samp_jags_cp, along = 3), c(1, 3, 2))
Convergence looks better now, although tau is still estimated not very efficiently.
mon <- monitor(samp_jags_cp)
print(mon)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -1.402 4.211 9.782 4.244 3.377 1.004 1175 1187
tau 0.244 2.854 9.899 3.673 3.204 1.005 653 700
theta[1] -1.880 5.552 16.111 6.083 5.665 1.002 1826 1622
theta[2] -2.750 4.668 12.567 4.797 4.738 1.002 2145 2967
theta[3] -4.968 3.956 11.759 3.749 5.328 1.002 2229 7328
theta[4] -3.151 4.486 12.481 4.589 4.864 1.003 2107 4318
theta[5] -4.431 3.665 10.683 3.474 4.723 1.002 1956 4772
theta[6] -4.076 3.987 11.423 3.877 4.885 1.002 2199 7352
theta[7] -1.355 5.730 15.548 6.229 5.201 1.002 1613 1288
theta[8] -3.516 4.614 13.379 4.747 5.385 1.002 2316 7483
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
The diagnostic plots of quantiles and small intervals tell a similar story.
plot_local_ess(samp_jags_cp, par = "tau", nalpha = 20)
plot_quantile_ess(samp_jags_cp, par = "tau", nalpha = 20)
Notably, however, the increase in effective sample size of tau is linear in the total number of draws indicating that convergence for tau may be achieved by simply running longer chains.
plot_change_ess(samp_jags_cp, par = "tau")
Result: Similar to Stan, Jags also has convergence problems with the centered parameterization of the eight schools model.
The Jags code for the non-centered eight schools model looks as follows:
writeLines(readLines("eight_schools_ncp.bugs"))
model {
for (j in 1:J) {
sigma_prec[j] <- pow(sigma[j], -2)
theta_tilde[j] ~ dnorm(0, 1)
theta[j] = mu + tau * theta_tilde[j]
y[j] ~ dnorm(theta[j], sigma_prec[j])
}
mu ~ dnorm(0, pow(5, -2))
tau ~ dt(0, pow(5, -2), 1)T(0, )
}
First, we initialize the Jags model for reusage later.
jags_ncp <- jags.model(
"eight_schools_ncp.bugs",
data = eight_schools,
n.chains = 4, n.adapt = 10000
)
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 8
Unobserved stochastic nodes: 10
Total graph size: 55
Initializing model
Next, we sample 1000 iterations for each of the 4 chains for easy comparison with the corresponding Stan results.
samp_jags_ncp <- coda.samples(
jags_ncp, c("theta", "mu", "tau"),
n.iter = 1000
)
samp_jags_ncp <- aperm(abind(samp_jags_ncp, along = 3), c(1, 3, 2))
Convergence diagnostics indicate much better mixing than for the centered eight school model.
mon <- monitor(samp_jags_ncp)
print(mon)
Inference for the input samples (4 chains: each with iter = 1000; warmup = 0):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -1.030 4.336 9.801 4.373 3.336 1.000 3035 3356
tau 0.276 2.871 10.338 3.837 3.528 1.002 1107 969
theta[1] -1.665 5.839 16.769 6.450 5.823 1.001 2997 2094
theta[2] -2.466 5.009 12.703 5.013 4.737 1.000 3593 3485
theta[3] -5.402 4.120 11.958 3.788 5.513 1.001 3465 2685
theta[4] -3.125 4.728 12.598 4.755 4.919 1.000 4097 3615
theta[5] -4.789 3.894 11.052 3.607 4.945 1.000 3684 3324
theta[6] -4.137 4.208 11.482 4.050 4.851 1.000 3879 3076
theta[7] -1.118 5.962 15.601 6.412 5.217 1.001 2728 2724
theta[8] -3.796 4.801 13.514 4.844 5.390 1.000 4150 3442
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
Specifically, the mixing of tau looks much better although we still see some problems in the estimation of larger quantiles.
plot_local_ess(samp_jags_ncp, par = "tau", nalpha = 20)
plot_quantile_ess(samp_jags_ncp, par = "tau", nalpha = 20)
Change in effective sample size is roughly linear indicating that some remaining convergence problems are likely to be solved by running longer chains.
plot_change_ess(samp_jags_ncp, par = "tau")
Result: Similar to Stan, Jags can sample from the non-centered parameterization of theeight schools model much better than from the centered parameterization.
We have already seen that the effective sample size of dynamic HMC can be higher than with independent draws. The next example illustrates interesting relative efficiency phenomena due to the properties of dynamic HMC algorithms.
We sample from a simple 16-dimensional standard normal model.
writeLines(readLines("normal.stan"))
data {
int<lower=1> J;
}
parameters {
vector[J] x;
}
model {
x ~ normal(0, 1);
}
fit_n <- stan(
file = 'normal.stan', data = data.frame(J = 16),
iter = 20000, chains = 4, seed = 483892929, refresh = 0
)
samp <- as.array(fit_n)
monitor(samp)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
x[1] -1.657 -0.002 1.647 -0.002 1.003 1 98264 28709
x[2] -1.641 -0.006 1.644 -0.003 0.996 1 95812 29664
x[3] -1.634 0.002 1.620 0.003 0.990 1 98640 28669
x[4] -1.648 0.005 1.663 0.006 1.007 1 97302 29166
x[5] -1.640 -0.001 1.634 0.002 0.999 1 101542 29930
x[6] -1.647 -0.001 1.646 -0.002 1.001 1 96292 28376
x[7] -1.628 0.008 1.629 0.003 0.993 1 96016 29238
x[8] -1.646 -0.007 1.650 -0.005 1.002 1 100375 29893
x[9] -1.640 0.006 1.655 0.004 0.998 1 101141 28621
x[10] -1.624 -0.007 1.630 -0.002 0.990 1 103126 29411
x[11] -1.645 0.006 1.661 0.004 1.000 1 95886 28488
x[12] -1.619 0.005 1.626 0.006 0.993 1 98433 29228
x[13] -1.621 0.010 1.650 0.003 0.990 1 98181 27421
x[14] -1.629 -0.002 1.625 -0.003 0.989 1 97313 27507
x[15] -1.625 0.005 1.643 0.006 0.992 1 95223 29139
x[16] -1.660 0.003 1.645 -0.001 1.006 1 99980 29639
lp__ -13.000 -7.660 -3.923 -7.951 2.790 1 14489 19627
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(samp)
print(res)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff Rhat
x[1] -0.002 0.003 1.003 -1.657 -0.002 1.647 97846 2.446 98426 97687 98264 2.457 1
x[2] -0.003 0.003 0.996 -1.641 -0.006 1.644 95437 2.386 95717 95531 95812 2.395 1
x[3] 0.003 0.003 0.990 -1.634 0.002 1.620 98375 2.459 98703 98313 98640 2.466 1
x[4] 0.006 0.003 1.007 -1.648 0.005 1.663 96642 2.416 97290 96654 97302 2.433 1
x[5] 0.002 0.003 0.999 -1.640 -0.001 1.634 101317 2.533 101514 101344 101542 2.539 1
x[6] -0.002 0.003 1.001 -1.647 -0.001 1.646 96011 2.400 96299 96003 96292 2.407 1
x[7] 0.003 0.003 0.993 -1.628 0.008 1.629 95558 2.389 96082 95492 96016 2.400 1
x[8] -0.005 0.003 1.002 -1.646 -0.007 1.650 99937 2.498 100392 99920 100375 2.509 1
x[9] 0.004 0.003 0.998 -1.640 0.006 1.655 100824 2.521 101134 100831 101141 2.529 1
x[10] -0.002 0.003 0.990 -1.624 -0.007 1.630 102178 2.554 102983 102317 103126 2.578 1
x[11] 0.004 0.003 1.000 -1.645 0.006 1.661 95226 2.381 95863 95250 95886 2.397 1
x[12] 0.006 0.003 0.993 -1.619 0.005 1.626 97828 2.446 98357 97903 98433 2.461 1
x[13] 0.003 0.003 0.990 -1.621 0.010 1.650 97666 2.442 98166 97682 98181 2.455 1
x[14] -0.003 0.003 0.989 -1.629 -0.002 1.625 96805 2.420 97327 96792 97313 2.433 1
x[15] 0.006 0.003 0.992 -1.625 0.005 1.643 94911 2.373 95185 94950 95223 2.381 1
x[16] -0.001 0.003 1.006 -1.660 0.003 1.645 99541 2.489 100111 99413 99980 2.499 1
lp__ -7.951 0.023 2.790 -13.000 -7.660 -3.923 14922 0.373 14934 14480 14489 0.362 1
sRhat zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff
x[1] 1 1 1 1 16450 0.411 28709 0.718 82432 2.061 19205
x[2] 1 1 1 1 16462 0.412 29664 0.742 75494 1.887 19208
x[3] 1 1 1 1 16152 0.404 28669 0.717 78630 1.966 18732
x[4] 1 1 1 1 16075 0.402 29166 0.729 81148 2.029 19079
x[5] 1 1 1 1 16785 0.420 29930 0.748 79953 1.999 20116
x[6] 1 1 1 1 16578 0.414 28376 0.709 79018 1.975 19626
x[7] 1 1 1 1 17109 0.428 29238 0.731 81690 2.042 19543
x[8] 1 1 1 1 16400 0.410 29893 0.747 79263 1.982 18828
x[9] 1 1 1 1 15890 0.397 28621 0.716 81119 2.028 18683
x[10] 1 1 1 1 16460 0.412 29411 0.735 76948 1.924 19242
x[11] 1 1 1 1 15969 0.399 28488 0.712 79164 1.979 18329
x[12] 1 1 1 1 15294 0.382 29228 0.731 81841 2.046 18720
x[13] 1 1 1 1 15370 0.384 27421 0.686 80615 2.015 18210
x[14] 1 1 1 1 16452 0.411 27507 0.688 77592 1.940 19050
x[15] 1 1 1 1 16651 0.416 29139 0.728 80406 2.010 19622
x[16] 1 1 1 1 16395 0.410 29639 0.741 82347 2.059 18845
lp__ 1 1 1 1 21484 0.537 19627 0.491 17074 0.427 23603
madsreff
x[1] 0.480
x[2] 0.480
x[3] 0.468
x[4] 0.477
x[5] 0.503
x[6] 0.491
x[7] 0.489
x[8] 0.471
x[9] 0.467
x[10] 0.481
x[11] 0.458
x[12] 0.468
x[13] 0.455
x[14] 0.476
x[15] 0.491
x[16] 0.471
lp__ 0.590
The Bulk-ESS for all \(x\) is larger than 9.522310^{4}. However tail-ESS for all \(x\) is less than 2.99300310^{4}. Further, bulk-ESS for lp__ is only 1.448910^{4}.
If we take a look at all the Stan examples in this notebook, we see that the bulk-ESS for lp__ is always below 0.5. This is because lp__ correlates strongly with the total energy in HMC, which is sampled using a random walk proposal once per iteration. Thus, it’s likely that lp__ has some random walk behavior, as well, leading to autocorrelation and a small relative efficiency. At the same time, adaptive HMC can create antithetic Markov chains which have negative auto-correlations at odd lags. This results in a bulk-ESS greater than S for some parameters.
Let’s check the effective sample size in different parts of the posterior by computing the effective sample size for small interval estimates for x[1].
plot_local_ess(fit_n, par = 1, nalpha = 100)
The effective sample size for probability estimate for a small interval is close to 1 with a slight drop in the tails. This is a good result, but far from the effective sample size for the bulk, mean, and median estimates. Let’s check the effective sample size for quantiles.
plot_quantile_ess(fit = fit_n, par = 1, nalpha = 100)
Central quantile estimates have higher effective sample size than tail quantile estimates.
The total energy of HMC should affect how far in the tails a chain in one iteration can go. Fat tails of the target have high energy, and thus only chains with high total energy can reach there. This will suggest that the random walk in total energy would cause random walk in the variance of \(x\). Let’s check the second moment of \(x\).
samp_x2 <- as.array(fit_n, pars = "x")^2
monitor(samp_x2)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
x[1] 0.004 0.458 3.849 1.006 1.438 1 16443 18225
x[2] 0.004 0.444 3.797 0.992 1.416 1 16492 19392
x[3] 0.004 0.446 3.799 0.980 1.385 1 16148 18342
x[4] 0.004 0.453 3.947 1.015 1.462 1 16070 18288
x[5] 0.004 0.453 3.865 0.998 1.415 1 16785 18672
x[6] 0.004 0.452 3.907 1.001 1.424 1 16572 17525
x[7] 0.004 0.452 3.742 0.987 1.388 1 17097 19120
x[8] 0.004 0.465 3.800 1.004 1.417 1 16397 18152
x[9] 0.004 0.448 3.806 0.995 1.408 1 15922 18049
x[10] 0.004 0.444 3.727 0.981 1.389 1 16461 18098
x[11] 0.004 0.458 3.854 1.001 1.410 1 16008 19463
x[12] 0.004 0.449 3.755 0.987 1.407 1 15368 17674
x[13] 0.004 0.444 3.834 0.981 1.381 1 15371 16755
x[14] 0.004 0.447 3.750 0.979 1.375 1 16461 17715
x[15] 0.004 0.451 3.766 0.983 1.384 1 16655 19241
x[16] 0.004 0.468 3.862 1.012 1.411 1 16400 19741
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(samp_x2)
print(res)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff Rhat sRhat zRhat
x[1] 1.006 0.012 1.438 0.004 0.458 3.849 14657 0.366 14664 16440 16443 0.411 1 1 1
x[2] 0.992 0.011 1.416 0.004 0.444 3.797 15511 0.388 15518 16488 16492 0.412 1 1 1
x[3] 0.980 0.011 1.385 0.004 0.446 3.799 14684 0.367 14703 16133 16148 0.404 1 1 1
x[4] 1.015 0.012 1.462 0.004 0.453 3.947 14510 0.363 14518 16039 16070 0.402 1 1 1
x[5] 0.998 0.012 1.415 0.004 0.453 3.865 15017 0.375 15029 16766 16785 0.420 1 1 1
x[6] 1.001 0.012 1.424 0.004 0.452 3.907 14368 0.359 14380 16547 16572 0.414 1 1 1
x[7] 0.987 0.011 1.388 0.004 0.452 3.742 15464 0.387 15488 17080 17097 0.427 1 1 1
x[8] 1.004 0.012 1.417 0.004 0.465 3.800 14773 0.369 14766 16397 16397 0.410 1 1 1
x[9] 0.995 0.012 1.408 0.004 0.448 3.806 13439 0.336 13467 15910 15922 0.398 1 1 1
x[10] 0.981 0.011 1.389 0.004 0.444 3.727 14654 0.366 14672 16430 16461 0.412 1 1 1
x[11] 1.001 0.011 1.410 0.004 0.458 3.854 15068 0.377 15098 15997 16008 0.400 1 1 1
x[12] 0.987 0.012 1.407 0.004 0.449 3.755 14215 0.355 14221 15342 15368 0.384 1 1 1
x[13] 0.981 0.012 1.381 0.004 0.444 3.834 13548 0.339 13547 15368 15371 0.384 1 1 1
x[14] 0.979 0.011 1.375 0.004 0.447 3.750 14547 0.364 14565 16418 16461 0.412 1 1 1
x[15] 0.983 0.011 1.384 0.004 0.451 3.766 15417 0.385 15414 16652 16655 0.416 1 1 1
x[16] 1.012 0.011 1.411 0.004 0.468 3.862 15551 0.389 15561 16390 16400 0.410 1 1 1
zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff madsreff
x[1] 1 1 18445 0.461 18225 0.456 19172 0.479 23268 0.582
x[2] 1 1 19699 0.492 19392 0.485 19309 0.483 24908 0.623
x[3] 1 1 18532 0.463 18342 0.459 18694 0.467 23865 0.597
x[4] 1 1 18983 0.475 18288 0.457 19156 0.479 23907 0.598
x[5] 1 1 19535 0.488 18672 0.467 20102 0.503 25043 0.626
x[6] 1 1 17535 0.438 17525 0.438 19596 0.490 22613 0.565
x[7] 1 1 19019 0.475 19120 0.478 19555 0.489 23336 0.583
x[8] 1 1 18920 0.473 18152 0.454 18816 0.470 24195 0.605
x[9] 1 1 18386 0.460 18049 0.451 18674 0.467 22198 0.555
x[10] 1 1 18570 0.464 18098 0.452 19147 0.479 23900 0.598
x[11] 1 1 19482 0.487 19463 0.487 18393 0.460 24709 0.618
x[12] 1 1 18691 0.467 17674 0.442 18588 0.465 23963 0.599
x[13] 1 1 18506 0.463 16755 0.419 18258 0.456 24132 0.603
x[14] 1 1 18823 0.471 17715 0.443 19062 0.477 23428 0.586
x[15] 1 1 19633 0.491 19241 0.481 19620 0.490 24476 0.612
x[16] 1 1 20083 0.502 19741 0.494 18829 0.471 24441 0.611
The mean of the bulk-ESS for \(x_j^2\) is 1.62906210^{4}, which is quite close to the bulk-ESS for lp__. This is not that surprising as the potential energy in normal model is proportional to \(\sum_{j=1}^J x_j^2\).
Let’s check the effective sample size in different parts of the posterior by computing the effective sample size for small interval probability estimates for x[1]^2.
plot_local_ess(fit = samp_x2, par = 1, nalpha = 100)
The effective sample size is mostly a bit below 1, but for the right tail of \(x_1^2\) the effective sample size drops. This is likely due to only some iterations having high enough total energy to obtain draws from the high energy part of the tail. Let’s check the effective sample size for quantiles.
plot_quantile_ess(fit = samp_x2, par = 1, nalpha = 100)
We can see the correlation between lp__ and magnitude of x[1] in the following plot.
samp <- as.array(fit_n)
qplot(
as.vector(samp[, , "lp__"]),
abs(as.vector(samp[, , "x[1]"]))
) +
labs(x = 'lp__', y = 'x[1]')
Low lp__ values corresponds to high energy and more variation in x[1], and high lp__ corresponds to low energy and small variation in x[1]. Finally \(\sum_{j=1}^J x_j^2\) is perfectly correlated with lp__.
qplot(
as.vector(samp[, , "lp__"]),
as.vector(apply(samp[, , 1:16]^2, 1:2, sum))
) +
labs(x = 'lp__', y = 'sum(x^2)')
This shows that even if we get high effective sample size estimates for central quantities (like mean or median), it is important to look at the relative efficiency of scale and tail quantities, as well. The effective sample size of lp__ can also indicate problems of sampling in the tails.
makevars <- file.path(Sys.getenv("HOME"), ".R/Makevars")
if (file.exists(makevars)) {
writeLines(readLines(makevars))
}
CXX14FLAGS=-O3 -Wno-unused-variable -Wno-unused-function
CXX14 = $(BINPREF)g++ -m$(WIN) -std=c++1y
CXX11FLAGS=-O3 -Wno-unused-variable -Wno-unused-function
devtools::session_info("rstan")
- Session info -----------------------------------------------------------------------------------
setting value
version R version 3.5.1 (2018-07-02)
os Windows 10 x64
system x86_64, mingw32
ui RTerm
language (EN)
collate German_Germany.1252
ctype German_Germany.1252
tz Europe/Berlin
date 2018-12-19
- Packages ---------------------------------------------------------------------------------------
package * version date lib source
assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0)
backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.1)
BH 1.66.0-1 2018-02-13 [1] CRAN (R 3.5.0)
callr 3.1.0 2018-12-10 [1] CRAN (R 3.5.1)
cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.1)
colorspace 1.3-2 2016-12-14 [1] CRAN (R 3.5.0)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0)
digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.1)
fansi 0.4.0 2018-10-05 [1] CRAN (R 3.5.1)
ggplot2 * 3.1.0 2018-10-25 [1] CRAN (R 3.5.1)
glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.1)
gridExtra * 2.3 2017-09-09 [1] CRAN (R 3.5.0)
gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.0)
inline 0.3.15 2018-05-18 [1] CRAN (R 3.5.1)
labeling 0.3 2014-08-23 [1] CRAN (R 3.5.0)
lattice 0.20-35 2017-03-25 [2] CRAN (R 3.5.1)
lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.0)
loo 2.0.0 2018-04-11 [1] CRAN (R 3.5.1)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0)
MASS 7.3-50 2018-04-30 [2] CRAN (R 3.5.1)
Matrix 1.2-14 2018-04-13 [2] CRAN (R 3.5.1)
matrixStats 0.54.0 2018-07-23 [1] CRAN (R 3.5.1)
mgcv 1.8-26 2018-11-21 [1] CRAN (R 3.5.1)
munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.1)
nlme 3.1-137 2018-04-07 [2] CRAN (R 3.5.1)
pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.1)
pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.1)
plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.0)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.1)
processx 3.2.1 2018-12-05 [1] CRAN (R 3.5.1)
ps 1.2.1 2018-11-06 [1] CRAN (R 3.5.1)
R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.1)
RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 3.5.0)
Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.1)
RcppEigen 0.3.3.5.0 2018-11-24 [1] CRAN (R 3.5.1)
reshape2 1.4.3 2017-12-11 [1] CRAN (R 3.5.0)
rlang 0.3.0.1 2018-10-25 [1] CRAN (R 3.5.1)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0)
rstan * 2.18.2 2018-11-07 [1] CRAN (R 3.5.1)
scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.1)
StanHeaders * 2.18.0-1 2018-12-13 [1] CRAN (R 3.5.1)
stringi 1.2.4 2018-07-20 [1] CRAN (R 3.5.1)
stringr * 1.3.1 2018-05-10 [1] CRAN (R 3.5.1)
tibble * 1.4.2 2018-01-22 [1] CRAN (R 3.5.0)
utf8 1.1.4 2018-05-24 [1] CRAN (R 3.5.1)
viridisLite 0.3.0 2018-02-01 [1] CRAN (R 3.5.0)
withr 2.1.2.9000 2018-12-18 [1] Github (jimhester/withr@be57595)
[1] C:/Users/paulb/Documents/R/win-library/3.5
[2] C:/Program Files/R/R-3.5.1/library